AFOSB-TBr  7 8-  0005 


Approved  for  public  release 
Alstributlon  ualimlted* 


o Word  Hypothisization 
for  Larse-Vocabuiary  Speech  Understanc&ig  Systems 


DEPARTMENT 

of 


COMPUTER  SCIENCE 


Carnegie -Mel  Ion  University 


Word  Hypothesization 

for  Large-Vocabulary  Speech  Understanding  Systems 


A.  Richard  Smith 
October  20,  1977 


Department  of  Computer  Science 
Carnegie-Mellon  University 
Pittsburgh,  PA,  15213 


Submitted  to  Carnegie-Mellon  University  in  partial  fulfillment  of  the  requirements  for  the 
degree  of  Doctor  of  Philosophy  in  Computer  Science. 

This  work  Is  supported  in  part  by  the  Defense  Advanced  Research  Projects  Agency 
under  contract  number  F44620-73-C-0074  and  is  monitored  by  the  Air  Force  Office  of 
Scientific  Research. 

AIR  FORCE  OFFICE  OF  SCIENTIFIC  RESEARCH  (AFSC) 

NOTICE  OF  TRANSMITT.i.L  TO  mC 

This  techr.ical  rc.nort  i . reviewed  and  is 
approved  for  p-f  1 r loa^e  iA.V  AFR  190-12  (Tb). 
Distribution  is  Uiiiimi-ted. 

A«  De  BliOSE 

Technical  Information  Officer 


Abstract 


This  thesis  describes  research  directed  toward  the  development  of  general 
Engiiah  speech  understanding  systems.  The  relatively  unconstrained  grammars  and 
large  vocabularies  characterizing  such  systems  require  them  to  eliminate  most  of  the 
words  found  in  their  vocabularies  by  using  only  acoustic  information.  In  particular,  we 
present  the  design  and  performance  of  a bottom-up  word  hypothesizer  capable  of 
handling  large  vocabularies  (>10,000  words)  which  takes  segmented  and  labeled  speech 
as  input  and  produces  word  hypotheses.  The  primary  concerns  of  the  thesis  are  the 
problems  involved  with  large  vocabularies  and  the  effect  of  large  vocabularies  on  word 
hypothesization. 

The  thesis  deals  with  the  following  problems:  1)  Knowledee  Representation: 
storing  the  acoustic  knowledge  of  words  efficiently  for  fast  retrieval;  2)  Knowledge 
Acquisition:  obtaining  the  acoustic  knowledge  for  a large  number  of  words  easily:  3) 
Flexibility:  permitting  improvements  to  be  made  to  the  acoustic  processors  of  the 
speech  system  (e.g.,  segmenter-labeler)  without  requiring  an  expensive  reacquisition  of 
Knowledge;  and  4)  Performance:  hypothesizing  many  of  the  correct  words  of  an 
utterance  and  few  incorrect  ones  within  "reasonable"  computation  constraints. 

The  solutions  to  these  problems  center  around  the  knowledge  representation 
used  by  the  word  hypothesizer.  Speech  is  represented  in  a hierarchy  of  levels 
containing  (from  bottom  to  top):  segment  labels,  syiparts,  syllables,  and  words. 
(Sylparts  include  onsets  (the  initial  non-nucleus  part  of  a syllable),  vowels  and  codas 
(the  final  non-nucleus  part  of  a syllable).)  Knowledge  is  stored  in  a hierarchy-tree 
representation.  That  is,  between  each  pair  of  adjacent  levels  (segment-sylpart,  sylpart- 
syllable,  and  syllable-word  level  pairs)  is  a tree  structure  storing  a sequence  of  lower 
level  units  to  define  a higher  level  unit.  The  tree  between  each  pair  of  levels  permits 
merging  common  initial  parts  of  sequences  to  reduce  storage  costs  and  recognition  time. 

The  solution  to  the  problem  of  knowledge  acquisition  is  to  separate  the  acoustic 
description  of  words  into  a)  a priori  knowledge:  base  pronunciations  of  words  acquired 
from  a word-phoneme  dictionary  and  stored  in  the  two  higher  level  trees  (the  sylpart- 
syllable  and  syllable-word  trees)  and  b)  learned  knowledge:  segment-label  patterns  of 
the  sylparts  acquired  by  training  the  hypothesizer  on  the  output  of  a particular 
segmenter-labeler  and  stored  in  the  lowest  level  tree  (the  segment-sylpart  tree).  This 
solution  is  made  possible  by  several  methods  of  handling  at  the  lowest  level  the 
coarticulation  problems  common  in  continuous  speech.  One  method  is  the  ability  to  learn 
a vowel-seouence.  which  may  occur  when  more  than  one  syllable  share  the  same 
syllable  nucleus.  A second  method  is  context-learning,  which  involves  learning  the 
surrounding  segment -context  of  a segment-pattern  in  order  to  account  for  variations  in 
the  segment-patterns  learned  for  a sylpart. 

We  present  several  measures  in  order  to  evaluate  the  storage  and  recognition 
cost  efficiency  of  the  representation,  analyze  the  recognition  algorithm,  and  evaluate 
the  performance  of  the  word  hypothesizer  over  different  vocabulary  sizes. 


The  word  hypothesizer  is  tested  on  105  utterances  (705  words)  for  7 different 
vocabulary  sizes  ranging  from  500  words  to  19,000  words.  The  performance  for  these 
vocabularies  ranges  from  a word  accuracy  of  73X  at  an  average  rank  of  2.6  for  the 
correct  hypotheses  using  a 500-word  vocabulary  to  a word  accuracy  of  58^  at  an 
average  rank  of  5.8  using  a 19,000-word  vocabulary.  According  to  the  average 
efficiency  measure  (developed  here),  this  performance  degrades  at  approximately  a 
logarithmic  rate  over  the  range  of  vocabuiary  sizes  tested.  The  computation  costs  begin 
at  2.4  MIPSS  (million  of  instructions  per  second  of  speech)  for  the  500-word  vocabulary 
and  increases  at  a logarithmic  rate  to  6.6  MIPSS  for  the  19,000-word  vocabulary. 

We  conclude  that  bottom-up  word  hypothesization  is  not  greatly  effected  by  the 
size  of  the  vocabulary  and  that  with  improvements  in  the  word  hypothesizer  and  the 
segmenter-labeler,  speech  understanding  systems  for  general  English  can  obtain  a great 
amount  of  constraint  from  the  acoustics  alone. 

The  major  contributions  are:  1)  A better  understanding  of  the  effect  of  large 
vocabularies  for  speech  understanding  systems,  2)  A solution  to  the  problem  of 
knowledge  acquisition  for  an  AI  knowledge -based  system,  3)  Several  methods  for 
handling  at  low  representation  levels  of  speech  some  of  the  coarticulation  problems  of 
continuous  speech,  and  4)  The  design  of  a bottom-up  word  hypothesizer  that  performs 
better  than  earlier  word  hypothesizers. 


UNCUSSIFIED 

security  C^^>Ht,FICATION  OF  THIS  PACE  flWi«n  Dmim  Bnltfd) 

_|  JEPORT  DOCUMBNTATibN  PAGE  befoIe®com™S''form 

IJ.  REPORiItJi^irihii  . . — 7 li  nnvT  orrminii  iin  s.  recipient's  catalog  number 

S-jgjBCjef  5 1 


4.  title  (mnd  Subtitle) 


WORD  HYPOTHESIZATION  FOR^^ARGE-VOCABULARY 
SPEECH  UNDERSTANDING  SYSTEMS# 


7.  AgTHORftJ 


A.  Richard |Smith  | 


9.  PERFORMING  ORGANIZATION  NAME  AND  ADDRESS 

Carnegie-Mellon  University 
Computer  Science  Dept. 
Pittsburgh,  PA  15213 


U.  MONITORING  AGENCY  NAME  A ADDRESSCir  di/ZerMil /ron  Conirollln#  OIMc*;  I IS.  SECURITY  CLASS.  <»<• 


Air  For 
Bolling 


Force  Office  of  Scientifi^xRe3€a£cl^^ 
IngAFB.  DC  ' 


LASSIFIED 


ISa.  DECLASSIFICATION/ DOWNGRADING 
SCHEDULE 


16.  DISTRIBUTION  ST ATEMENT  fo/ fhia  KaporO 


Approved  for  public  release;  distribution  unlimited* 


n.  DISTRIBUTION  STATEMENT  (ol  th»  mbttfct  anfaratf  In  Block  20,  It  dtUoroni  from  Report) 


19.  KEY  WORDS  fConllnua  on  ravaraa  aide  it  nacaaaary  and  Idenllry  by  block  nuinbor) 


20.  abstract  CConflnua  on  ravaraa  alda  it  nacaaaary  and  Idand^  by  block  mtmbor) 

This  thesis  describes  research  directed  toward  the  development  of  general 
English  speech  understanding  systems.  The  relatively  unconstrained  grammars  and 
large  vocabularies  characterizing  such  systems  require  them  to  eliminate  most  of  the 
words  found  in  their  vocabularies  by  using  only  acoustic  information.  In  particular,  we 
present  the  design  and  performance  of  a bottom-up  word  hypothesizer  capable  of 
handling  large  vocabularies  (>10,000  words)  which  takes  segmented  and  labeled  speech 


DD  1473  EDITION  OF  I NOV  45  IS  OBSOLETE 


S/N  010}-0I4>S60I  I 


3 Ogi 


UNCLASSIFIED 

SECURITY  CLASSIFICATION  OF  THIS  PAGE  r»>i*n  Oaf* 


^■1 

r 

! 

UNCLASSIFIED 

i V/ 

V * ^ 

^Li-umTY  classification  OF  THIS  PAOEflWnn  p«f  Bnimrmd) 


2.6,  as  input  and  produces  word  hypotheses.  The  d'rimary  concerns  of  the  thesis  are  the 
problems  involved  with  large  vocabularies  and  the  effect  of  large  vocabularies  on  word 
hypolhesization. 

The  thesis  deals  with  the  following  problems:  1)  Knowledge  Representation: 
storing  the  acoustic  Knowledge  of  words  efficiently  for  fast  retrieval;  2)  Knowledge 
Acquisition:  obtaining  the  acoustic  Knowledge  for  a large  number  of  words  easily;  3) 
Flexibility:  permitting  improvements  to  be  made  to  the  acoustic  processors  of  the 
speech  system  (e.g.,  segmenter-labeler)  without  requiring  an  expensive  reacquisition  of 
Knowledge;  and  4)  Performance:  hypothesizing  many  of  the  correct  words  of  an 
utterance  and  few  incorrect  ones  within  "reasonable"  computation  constraints. 

The  solutions  to  these  problems  center  around  the  knowledge  representation 
used  by  the  word  hypothesizer.  Speech  is  represented  in  a hierarchy  of  levels 
containing  {from  bottom  to  top):  segment  labels,  sylparts,  syllables,  and  words.  ■ 
(Sylparts  include  onsets  (the  initial  non-nucleus  part  of  a syllable),  vowels  and  codas 
(the  final  non-nucleus  part  of  a syllable).)  Knowledge  is  stored  in  a hierarchy-tree 
representation.  That  is,  between  each  pair  of  adjacent  levels  (segment-sylpart,  sylpart- 
syllable,  and  syllable-word  level  pairs)  is  a tree  structure  storing  a sequence  of  lower 
level  units  to  define  a higher  level  unit.  The  tree  between  each  pair  of  levels  permits 
merging  common  initial  parts  of  sequences  to  reduce  storage  costs  and  recognition  time. 

The  solution  to  the  problem  of  Knowledge  acquisition  is  to  separate  the  acoustic 
description  of  words  into  a)  a priori  knowledge:  base  pronunciations  of  words  acquired 
from  a word-phoneme  dictionary  and  stored  in  the  two  higher  level  trees  (the  sylpart- 
syllable  and  syllable-word  trees)  and  b)  learned  Knowledge:  segment-label  patterns  of 
the  sylparts  acquired  by  training  the  hypothesizer  on  the  output  of  a particular 
segmenter-labeler  and  stored  in  the  lowest  level  tree  (the  segment-sylpart  tree).  This 
solution  is  made  possible  by  several  methods  of  handling  at  the  lowest  level  the 
coarticulation  problems  common  in  continuous  speech.  One  method  is  the  ability  to  learn 
a vowel-sequence,  which  may  occur  when  more  than  one  syllable  share  the  same 
syllable  nucleus.  A second  method  is  context-learning,  which  involves  learning  the 
surrounding  segment-context  of  a segment-pattern  in  order  to  account  for  variations  in 
the  segment-patterns  learned  for  a sylpart. 

We  present  several  measures  in  order  to  evaluate  the  storage  and  recognition 
cost  efficiency  of  the  representation,  analyze  the  recognition  algorithm,  and  evaluate 
the  performance  of  the  word  hypothesizer  over  different  vocabulary  sizes. 

The  word  hypothesizer  is  tested  on  105  utterances  (705  words)  for  7 different 
vocabulary  sizes  ranging  from  500  words  to  19,000  words.  The  performance  for  these 
vocabularies  ranges  from  a word  accuracy  of  737.  at  an  average  rank  of  2.6  for  the 
correct  hypotheses  using  a 500-word  vocabulary  to  a word  accuracy  of  58^  at  an 
average  rank  of  5.8  using  a 19,000-word  vocabulary.  According  to  the  average 
efficiency  measure  (developed  here),  this  performance  degrades  at  approximately  a 
logarithmic  rate  over  the  range  of  vocabulary  sizes  tested.  The  computation  costs  begin 
at  2.4  MIPSS  (million  of  instructions  per  second  of  speech)  for  the  500-word  vocabulary 
and  increases  at  a logarithmic  rate  to  6.6  MIPSS  for  the  19,000-word  vocabulary. 


UNCLASSIFIED 


SECURITY  CLASSIFICATION  OF  THIS  PACEfWhtn  Dmtm  Bnl»r»d) 


.n.umTv  CLAssiriQTiow  or  this  rAOEnn>«i  ox*  en>«nrf> 


T 


I 


f 


7 


We  conclude  thatboUom-up  word  hypothesization  is  not  greatly  effected  by  the 
size  of  the  vocabulary  and  that  with  improvements  in  the  word  hypothesizer  and  the 
segmenter-labeler,  speech  understanding  systems  for  general  English  can  obtain  a great 
amount  of  constraint  from  the  acoustics  alone. 


The  major  contributions  are;  1)  A better  understanding  of  the  effect  of  large 
vocabularies  for  speech  understanding  systems,  2)  A solution  to  the  problem  of 
knowledge  acquisition  for  an  AI  knowledge-based  system,  3)  Several  methods  for 
handling  at  low  representation  levels  of  speech  some  of  the  coarticulation  problems  of 
continuous  speech,  and  4)  The  design  of  a bottom-up  word  hypothesizer  that  performs 
better  than  earlier  word  hypothesizers. 


f 


UNCLASSIFIED 


SECoatTY  CLASSIFICATION  OF  THIS  PAOEflFh»n  Dmim  Enitrtd) 


Acknowledgments 


I would  like  to  acknowledge  the  aid  and  support  of  several  people  who  in  many 
ways  made  the  completion  of  these  graduate  years  possible.  First,  I would  thank  my 
advisor,  colleague,  and  friend,  Lee  Ermaa  Combined  in  one  person  are  all  the  ideal 
qualities  of  an  advisor:  excellent  advice,  interest  in  the  student  and  project,  availability, 
and  editing  skills.  I want  to  thank  Raj  Reddy  for  his  guidance  in  my  earlier  graduate 
years  and  for  attracting  ms  to  speech  understanding  research.  My  thanks  also  go  to 
the  other  members  of  my  committee,  Jon  Bentley  and  Ron  Cole,  for  their  help  in 
improving  the  quality  and  readability  of  the  thesis. 

At  various  stages  of  the  work,  I received  the  competent  assistance  of  Jim 
Berenbaum,  Rajiv  Bhalla,  and  Eric  Grant.  They  professionally  did  what  was  often  tedious 
work.  The  environment  of  the  department  contributed  in  the  form  of  excellent  facilities 
and  a pool  of  willing  experts. 

1 received  much  needed  moral  support  from  several  people.  I want  to 
acknowledge  a fellow  graduate.  Dale  Partin,  whose  "every  Monday  lunch  meetings" 
helped  me  to  keep  a proper  perspective  on  work  and  life.  I would  thank  my  parents  for 
their  years  of  encouragement  and  prayers.  The  general  interest  and  expressed  concern 
of  many  others:  friends,  relatives,  and  office-mates,  was  appreciated.  But  above  ^ll,  I 
publicly  acknowledge  the  encouragement,  sympathy,  exhortation,  and  sustaining 
confidence  of  my  wife  Penny.  She  has  successfully  brought  a husband  through  a thesis 
and  a baby  through  the  first  eight  months  of  life. 

Finally,  I give  ultimate  credit  to  "the  One  perfect  in  knowledge"  [Job  36:4],  soU 
Deo  gloria. 


•V 


Contents 


AbstrKt 


Acknowledgmsnts 


III 


1.  Introduction! 

1.1  Motivation  1 

1.2  The  Problem  2 

1.2.1  Background  of  Speech  Recognition  and  Understanding  Research 

1.2.2  Top-Down  versus  Bottom-Up  Word  Hypothesization 

1.2.3  Large  Vocabularies 

1.2.4  Human  Performance 

1.2.5  Summmary 

1.3  Previous  Research  8 

1.3.1  Design  of  POiv^OW 

1.3.2  Results  for  POMOW 

1.3.3  Limitations  of  POMOW 

1.3.4  Conclusions  for  POMOW 

1.4  Overview  of  Noah  13 

1.5  Organization  of  Chapters  14 

1.6  Hints  to  the  Reader  15 


2.  Structures  and  Measures 

2.1  Introduction 

2.2  Knowledge  Representation  Structures 

2.3  Measures  of  Storage  and  Recognition  Costs 

2.4  The  Confusion  of  Hypotheses 

2.4.1  Segmenter-Labeler  Hypotheses 

2.4.2  A Measure  of  the  Confusion  of  Hypotheses 


17 

17 

17 

20 

24 


Representation  of  Knowledge 

27 

3.1  Introduction 

27 

3.2  An  Example 

27 

3.3  Levels  of  Speech  Representation 

31 

3.4  Auxiliary  Information 

33 

3.5  Application  of  Storage  and  Recognition  Measures 

34 

3.6  Storage  Costs 

36 

3.7  Other  Knowledge  Representations  for  Speech 

39 

3.6.1  Tree 

3.6.2  Network 

3.6.3  Transition  Network 

3.6.4  ACORN 

Acquisition  of  Knowledge 

45 

4.1  Introduction 

45 

4.2  Dictionary  Knowledge 

45 

4.3  Segment-Label  Knowledge 

46 

4.3.2  Segment  Pattern  Learning 

4.3.3  Segment  Pattern  Learning  for  Vowels 

4.3.3. 1 Syllable  Nuclei 

4.3.3.2  Nonsequential  Pattern  Storage 

4.3.3.3  Vowel  Sequence  Learning 

4.3.4  Context  Learning 

4.3.5  Hand-made  Segment  Patterns 


5.  Recognition 

5.1  Introduction 

5.2  The  Recognition  Algorithm 

5.2.1  Information  Needed  for  Recognition 

5.2.2  One  Step  in  Recognition 

5.2.3  Features  of  Recognition  Unique  to  the  Lower  Levels 

5.2.3. 1 Segment  Level  to  Sylpart  Level 

5.2.3.2  Sylpart  Level  to  Syllable  Level 

5.2.4  Parallel  Recognition 

5.3  Rating  of  Hypotheses 

5.4  Propagation  of  Segment  Label  Confusion  During  Recognition 

6.  Results  and  Analysis 

6.1  Introduction 

6.1.1  Measurements  of  Performance 

6.1. 1.1  Word  Accuracy  and  Average  Rank 

6.1. 1.2  Average  Efficiency 

6.1. 1.3  Summary  of  Performance  Measures 

6.1.2  Training  and  Testing  Conditions 

6.2  Performance  and  Runtime  Characteristics 

6.2.1  Performance  versus  Word  Vocabulary  Size 

6.2.2  Performance  versus  Training  Sample  Size 

6.2.3  Computation  Costs  versus  Vocabulary  Size 

6.2.4  Breakdown  of  Storage  Costs 

6.3  Analysis 

6.3.1  Effect  of  Vocabulary  Size  on  Performance 

6.3.2  Effect  of  Training  on  Performance 

6.3.3  Error  Analysis  for  Sylpart  Recognition 

6.3.4  Effect  of  Word  Length  on  Word  Accuracy 

6.3.5  Word  Training  versus  Sylpart  Training 

6.3.6  What  Words  should  be  Hypothesized? 

6.4  Comparision  with  other  Word  Hypothesizers 

6.4.1  POMOW-Wizard 

6.4.2  Lexical  Retrieval  Component  in  the  HWIM  System 

7.  Summary  and  Conclusions 

7.1  Summary 

7.1.1  Performance 

7.1.2  Runtime  Characteristics 

7.2  Conclusions 

7.3  Contributions 

7.4  Other  Applications 


vi 


7.4.1  Analysis  of  the  Word  Sound  Similarity  Space 

7.4.2  Image  Understanding 

7.5  Future  Research  94 

7.5.1  Suggested  Improvements  for  Noah 

7.5.1. 1 Tuning  the  System 

7.5. 1.2  Selective  Training 

7.5.1.3  Additional  Information 

7.5.2  Speech  System  Integration  for  Noah 

7.5.2.1  Performance  within  a Particular  Speech  System 


7.5.2.2  System  Control 
7.5.3  Great  Expectations 

References  99 

Appendix  A;  "ARPABET"  Computer  Phonetic  Representation  102 

Appendix  B:  Lexicons  103 

Appendix  C:  Schwa  Deletion  Rules  111 

Appendix  D:  Training  and  Testing  Utterances  1 14 


1.1  Motivation 


Speech  Understanding  Systems  to  date  have  depended  on  a very  constrained 
syntax  and  semantics  model  of  language  and  a small  (ilOOO)  word  vocabulary  in  order 
to  handle  the  problem  of  understanding  continuous  speech.  Though  we  cannot  claim 
that  such  systems  have  solved  the  continuous  speech  problem  even  for  such 
constrained  tasks,  we  should  begin  to  look  ahead  towards  speech  systems  capable  of 
handling  general  English  with  its  weakly  constrained  syntax  and  semantics  and  its 
relatively  unlimited  vocabulary.  One  of  the  characteristics  of  these  more  ambitious 
systems  will  be  their  ability  to  eliminate  most  of  the  words  found  in  their  vocabularies 
as  possible  candidates  for  what  was  spoken  by  using  only  the  acoustic  information.  It  is 
the  purpose  of  this  thesis  to  Investigate  to  what  degree  a word  hypothesizer  can  use 
the  acoustic  information  of  the  utterance  to  reduce  the  search  space  of  possible  word 
sequences  for  such  speech  systems.  The  primary  concern  will  be  the  effect  that  large 
vocabularies  have  on  the  performance  of  a word  hypothesizer.  The  investigation  is 
carried  out  by  designing  and  implementing  a word  hypothesizer  capable  of  handling 
large  vocabularies  in  order  to  study  the  effect  of  vocabulary  size. 

The  work  for  this  thesis  actually  includes  the  design  and  implementation  of  two 
word  hypothesizers:  1)  the  POWOW  word  hypothesizer  [Smith  - 1976]  used  in  the 
Hearsay-II  speech  understanding  system,  which  was  demonstrated  on  September  8, 
1976,  at  Carnegie-Wellon  University  [CMU  Computer  Science  Speech  Group  - 1977]  and 
2)  the  Noah^  word  hypothesizer  that  developed  out  of  POMOW  and  which  was  designed 
specifically  with  large  vocabularies  in  mind.  POMOW  will  be  considered  here  as  previous 
research  and  will  be  looked  at  only  to  understand  some  of  the  problems  which  had  to 
be  handled  by  Noah.  This  chapter  discusses  the  word  hypothesization  problem,  looks  at 
previous  research,  and  gives  an  overview  of  Noah  and  the  remaining  chapters. 

1 For  those  interested  in  name  derivations,  "POMOW"  was  derived  from  "Phones 
hypothesize  Morphemes  (syllables)  hypothesize  Words".  "Noah"  was  named  for  Noah 
Webster,  an  early  American  lexicographer. 


2 


1.2  The  Problem 

The  nature  of  speech  is  such  that  there  is  no  direct  mapping  from  acoustic 
information  to  a unique  spoken  word.  The  acoustic  pattern  of  a word  is  embedded 
within  the  total  pattern  of  the  utterance  and  modified  by  it.  This  is  called  the 
coarticulation  problem.  A listener  interprets  an  acoustic  event  not  only  by  what 
actually  occurs  in  the  utterance  but  also  by  the  surrounding  context  and  even  by  what 
he  expects  to  hear.  Environmental  .noise,  differences  between  speakers,  differences  for 
the  same  speaker  at  different  times,  and  variations  in  pronunciations  also  add  to  the 
difficulty  of  finding  what  words  were  spoken  in  an  utterance.  Another  problem  is 
carelessness  by  the  speaker;  it  seems  that  a person  often  speaks  just  well  enough  to  be 
understood  (most  of  the  time)  by  another  human  [Newell  - 1975]. 

1 .2.1  Background  of  Speech  Recognition  and  Understanding  Research 

The  history  of  research  toward  a solution  of  the  above  problems  is  a history  of 
a step  by  step  relaxing  of  constraints  on  the  problem  as  progress  was  made.  The  work 
of  Reddy  and  Reddy  A V.cr;,.'  at  Stanford  University  ([Reddy  - 1966];  [Reddy  and  Vicens 
- 19681  [Vicens  - 1969])  resulted  in  extending  the  state-of-the-art  of  isolated  word 
recognition  systems,  (e.g.,  917.  accuracy  on  a 561-word  vocabulary  in  ten  times  real- 
time on  a PDPIO  and  with  live  input).  Although  the  Vicens-Reddy  system  increased  the 
vocabulary  size  by  an  order  of  magnitude  over  that  permitted  by  former  word 
recognition  systems,  coarticulation  problems  were  avoided  by  requiring  a user  to  speak 
the  words  of  a sentence  separated  by  short  silences.  Important  differences  between 
that  system  and  earlier  ones  were  that  it  contained  a substantial  amount  of  speech 
kno'.vledge  and  it  used  extensive  heuristics  in  applying  the  knowledge  to  prune  the 
search  space.  Early  word  recognition  systems  were  essentially  pattern  classifiers. 

The  Hearsay-]  model  of  speech  understanding,  developed  at  CMU  during  1970- 
1971  ([Reddy,  Erman,  and  Neely  - 1972]),  faced  the  problems  of  connected  speech.  The 
imolementation  of  this  model  as  the  Hearsay-!  system  [Reddy,  Erman,  and  Neely  - 1973X 
resulted  in  the  first  demonstrable  (June,  1972)  live  system  to  handle  non-trivial 
connected  speech.  In  order  to  handle  connected  speech  and  obtain  a sentence  and 
word  accuracy  of  797,  and  93^  respectively,  the  system  depended  on  a very 
constrained  syntax  and  semantic  model  of  speech  (e.g.,  the  chess  task)  and  a very  small 
vocabulary  (<40  words).  The  system  served  to  clarify  the  nature  and  necessary 
interaction  of  several  sources  of  knowledge  by  using  three  independent  cooperating 
sources  of  Knowledge  (acoustic-phonetic,  syntactic,  and  semantic). 

Concurrent  with  the  development  of  the  Hearsay-!  model,  a group  was  formed 
by  the  Advanced  Research  Projects  Agency  (ARPA)  to  study  the  feasibility  of 
developing  speech  understanding  systems.  The  resulting  report  [Newell,  et  al.  - 1971] 


1  Introduction 


3 


gives  a comprehensive  and  detailed  analysis  of  the  problems  involved  and  specifies 
reasonable  constraints  for  a five  year  research  effort  on  the  problem.  Notable  among 
these  constraints  are:  [The  system  should]  "accept  connected  speech^..,  permitting  a 
slightly  selected  vocabulary  of  1000  words,  with  a highly  artificial  syntax,  and  a task 
with  a constrained  and  fairly  simple  semantics,  toierating  less  than  102  semantic 
error,  in  a few  times  real-time, 

On  recommendation  of  the  study  group  a five  year  ARPA  Speech  Understanding 
Research  effort  began  in  October,  1971.  Of  the  speech  systems  demonstrated  in  1976, 
the  Harpy  system  [Lowerre  - 1976]  succeeded  in  meeting  the- specifications  of  the 
project;  the  Hearsay-II  system  came  ciose. 

Common  to  all  of  the  speech  systems  resulting  from  the  five-year  effort  (several 
of  which  are  mentioned  below)  is  the  use  of  the  hypothesize-and-test  paradigm.  That 
is,  these  systems  attempt  to  solve  the  speech  understanding  problem  by  an  iterative 
process  of  a)  creating  hypotheses,  "educated  guesses"  about  some  aspect  of  the 
problem,  and  b)  testing  the  plausibility  of  the  hypotheses.  An  important  part  of  these 
systems  is  word  "guessing",  i.e.,  word  hypothesizing. 

1 .2.2  Top-Down  versus  Bottom-Up  Word  Hypothesization 

Faced  with  the  problem  of  finding  the  acoustic  pattern  of  a word  embedded 
within  the  total  pattern  of  the  utterance,  researchers  have  often  taken  the  approach  of 
avoiding  any  word  hypothesization  from  the  lower  acoustic  levels^.  Rather,  all 
hypothesization  is  based  on  syntactic  constraints  from  higher  levels.  Word  verification 
(the  test  part  of  the  hypothesize-and-test  paradigm)  is  then  performed  by  generating  a 
low-level  representation  (such  as  phones^,  acoustic  segments,  or  even  a spectrogram) 
for  each  hypothesized  word  and  then  matching  this  representation  against  a part  of  the 
actual  input  to  derive  a score.  The  word  with  the  best  score  is  taken  to  be  the  spoken 
word  for  that  part  of  the  utterance. 

One  method  of  doing  top-down  word-hypothesization  / word-verification  is 
found  in  Hearsay-L  The  system  works  left-to-right  (or  right-to-left)  through  the 
complete  utterance.  All  words  that  can  legally  begin  (end)  a sentence  are  hypothesized 
at  the  beginning  (ending).  Those  words  rated  well  by  the  word  verification  step  give 


2 Speech  is  often  viewed  by  systems  as  having  different  representations  in  a hierarchy 
of  levels  beginning  at  the  bottom  with  the  speech  waveform,  continuing  up  through 
levels  such  as  parametric,  phonetic,  syllabic,  lexical  (or  word),  and  ending  in  a phrase 
or  semantic  level  at  the  top. 

3 We  will  use  "phone"  to  refer  to  a sound  detected  and  classified  by  a program  and 
"phoneme"  for  the  expectation  of  that  sound  as  ertered  in  a word  pronunciation 
dictionary. 


4 


f 


i 

i 


f 


syntactic  and  semantic  constraint  for  hypothesizing  the  following  (preceding)  adjacent 
word.  This  process  is  continued  across  the  utterance  until  the  end  (beginning)  is 
reached. 

Such  top-down  methods  of  word  hypothesization  (also  called  anaivsis-bv- 
synthesis)  are  potentially  more  accurate  than  the  alternative  bottom-up  hypothesization 
from  the  acoustics.  This  is  because  the  transformation  and  match  takes  place  for  each 
word  with  knowledge  about  the  possible  context  around  the  word  and  with  knowledge 
about  how  various  acoustic  events  within  the  word  might  interact  to  produce  the 
speech.  Linguists  have  developed  a fairly  detailed  generative  model  of  speech 
describing  these  transformations  from  words  to  the  speech  signal.  The  problem  with 
top-down  methods  is  that  they  are  slow  if  many  words  must  be  matched  in  each  place  in 
the  utterance;  systems  waste  time  matching  words  which  may  be  syntactically  correct 
but  have  very  little  acoustic  support.  As  the  semantic  and  syntactic  speech  model 
becomes  more  general,  top-down  systems  are  swamped  by  the  number  of  words 
hypothesized.  Hearsay-I  [Reddy,  Erman,  and  Neely  - 1973],  the  Lincoln  Lab  Speech 
System  [Forgie  - 1974],  the  Harpy  Speech  System  [Lowerre  - 1976]  and  the  IBM 
Speech  Recognition  System  [Bahl,  et  al.  - 1976]  are  examples  of  systems  using  top- 
down  methods. 

The  method  of  bottom-up  word  hypothesization  attempts  to  infer  from  the 
acoustic  information  what  subset  of  words  from  the  vocabulary  may  have  been  spoken 
in  each  part  of  the  utterance.  These  hypotheses  may  then  be  verified  as  in  the  top- 
down  methods. 

The  amount  of  acoustic  information  used  for  word  hypothesization^  varies.  One 
rarely-used  version  of  Hearsay-I  tried  to  use  gross  acoustic  events,  such  as  the  "SH"  in 
"bishop",  to  suggest  possible  words  in  a region.  We  know  of  only  two  examples  of  word 
hypothesizers  using  extensive  acoustic  information.  They  are  Klovstad's  "Probabilistic 
Lexical  Retrieval  Component"  for  the  B6N  (Bolt,  Beranek  and  Newman  Inc.)  HWIM  system 
[Klovstad  - 1976]  and  POMOW  [Smith  - 1976],  the  word  hypothesizer  knowledge  source 
in  Hearsay-IL 

Although  the  HWIM  system  [Woods,  et  al.  - 1976]  is  oriented  towards  a top- 
down  approach  of  speech  recognition,  in  one  mode  of  operation  of  the  system  the 
lexical  retrieval  component  does  an  initial  scan  of  the  utterance  to  find  the  best  N words 
(N>15  currently)  which  fit  in  the  utterance.  Assuming  that  the  highest  rated  words  are 
correct,  the  system  tries  to  match  words  which  are  syntactically  consistent  to  the  left 
and/or  to  the  right  of  each  hypothesized  word.  Continuing  in  this  way,  the  grammar 


4 In  the  remainder  of  the  thesis,  the  term  "word  hypothesization"  by  itself  will  be  used 
to  mean  bottom-up  word  hypothesization 


J 


r ^ 

1 Introduction  5 

controls  what  words  are  matched  adjacent  to  each  highly  rated  word  until  the  utterance 
is  covered.  The  lexical  retrieval  component  is  driven  by  the  possible  phone  sequences 
hypothesized  for  the  utterance.  For  each  partial  sequence  of  phones,  the  probability 
for  each  of  the  best  words  matching  the  sequence  is  computed.  The  best  performance 
of  the  HWIM  system,  however,  is  not  obtained  using  that  mode,  but  rather  with  a left* 
to-right,  top-down  control  (i.e.,  similar  to  the  approach  used  by  Hearsay-!). 

The  POMOW  word  hypothesizer  uses  segment-label  sequences  to  hypothesize 
many  words  throughout  the  utterance  for  the  Hearsay-!!  speech  system.  The  segment 
level  description  of  speech  is  obtained  by  segmenting  the  acoustic  description  into 
similar  parts  (e.g.,  silence-like,  nasal -like,  noise-like,  etc.}.  Each  part  is  characterized  by 
assigning  it  a label  (hence,  segment-labels).  The  knowledge  source  that  does  this  work 
is  called  the  segmenter-labeler.  !n  general,  the  segment  level  is  a more  detailed 
description  of  speech  than  the  phone  level  (thus,  conceptually  lower  than  the  phone 
level).  For  example,  the  phone  T”  is  often  made  up  of  a silent-like  segment  followed  by 
an  aspiration-like  segment  (i.e.,  a short  release  of  breath). 

After  words  are  hypothesized  by  POMOW,  the  word  verifier  (Wizard  [McKeown  - 
1977])  than  makes  a more  careful  match  of  the  words  to  the  acoustics  before  another 
knowledge  source  searches  for  sequences  of  words  meeting  the  syntactic  constraints  of 
the  grammar.  The  system  continues  in  a top-down  mode  by  appending  words  to  the 
ends  of  these  sequences.  The  best-rated  sequence  of  words  forming  a sentence  in  the 
grammar  is  the  recognized  utterance. 

Hearsay-!!  is  very  sensitive  to  the  performance  of  the  word  hypothesizer. 
When  33  utterances  were  put  into  three  groups  according  to  the  accuracy  of  the  word 
hyothesizer  (about  67  utterance  words  per  group)  the  performance  of  the  system  given 
in  Figure  1.1  was  found.  The  sentence  error  rate  is  the  percent  of  sentences  for  which 
the  system  missed  at  least  one  word.  The  processing  time  does  not  include  the  time  for 
bottom-up  word  hypothesization.  !t  is  clear  that  the  performance  of  the  word 
I hypothesizer  greatly  effects  the  accuracy  and  speed  of  the  speech  system. 

i 1 .23  Large  Vocabularies 

^ Most  of  the  problems  of  large  vocabularies  are  typical  of  anything  large  in 

I 

computer  science.  Investigation  of  these  problems  for  a word  hypothesizer  are  the 
I central  topics  of  this  disertation: 

I Storage  reouirements  must  be  controlled  and  computation  costs 

[ kept  down  as  the  vocabulary  increases. 

I The  problem  of  performance  degradation  due  to  larger 

I vocabularies  must  be  handled.  As  more  words  are  added  to  a 


I 


TIim 

100  (SMOndt) 
♦ 

so 


Figure  1.1;  Performance  of  Hearsay-II  versus  Word  Hypothesization  Accuracy 


vocabulary,  the  more  likely  it  is  that  words  will  be  confused  with  one 
another.  Sometimes  this  is  because  one  word  is  a subpart  of  another, 
such  as  "Plea"  being  confused  with  "Please".  Other  times  the  confusion 
comos  because  words  have  similar  acoustic  descriptions  and  the  word 
hypothesizer  (and/or  the  segmenter-labeler)  cannot  distinguish  between 
the  acoustic  patterns.  An  example  of  this  is  the  words  "What"  and 
"Watt". 

A potentially  crippling  problem  is  knowledge  acquisition.  This  is 
a common  and  important  problem  for  AI  knowledge-based  systems 
[Feigenbaum  - 1977].  For  each  word  to  be  recognized  by  the  word 
hypothesizer,  a description  of  how  it  will  appear  in  speech  is  needed. 
This  acoustic  description  must  somehow  include  all  the  variations  which 
the  word  will  undergo  due  to  the  problems  mentioned  above 
(coarticulation,  pronunciation  variations,  etc.).  One  can  acquire  the 
acoustic  descriptions  for  1000  words  by  hand,  but  the  amount  of  work 
required  becomes  increasingly  prohibitive  for  10,000  words  or  100,000 
words.  Thus,  a necessary  goal  for  the  Noah  word  hypothesizer  is  the 
ability  to  add  words  to  its  vocabulary  easily. 

Related  to  the  problem  of  knowledge  acquisition  is  the  problem  of 
flexibility.  If  the  acoustic  descriptions  are  tied  to  a particular 
segmenter-labeler,  major  effort  must  be  spent  to  take  advantage  of  an 


r - — 

1 Introduction 

Improved  segmenter-labeler.  A goal  for  Noah  la  the  ability  to  change  to 

a new  (and  better)  segmenter-labeler  with  little  effort  expended. 

1.2.4  Human  Performance 

Although  it  is  hard  to  constrain  a human  to  the  role  of  a bottom-up  word 
hypothesizer,  some  measure  of  human  performance  is  found  in  the  work  of  Miller  and 
Isard  [Miller  & Isard  - 1963].  Part  of  the  their  work  involved  testing  the  ability  of 
subjects  to  recognize  (i.e.,  repeat  back)  the  words  of  ungrammatical  "utterances"  spoken 
to  them  (e.g.,  "The  built  a was  tamer  fortune  blaze  by  lazy").  Thus,  the  syntactic  and 
semantic  constraints  were  removed,  forcing  the  subjects  to  recognize  the  words  from 
the  acoustic  information  alone.  In  a test  of  50  utterances  (5  to  9 words  each),  56.17  of 
the  utterances  were  repeated  back  exactly  (i.e.,  all  words  recognized)  and  88.37  of  the 
principal  words  (i.e.,  not  function  words  like  "the"  and  "a")  were  recognized.  The 
accuracy  for  function  words  was  found  to  be  lower  (however,  those  numbers  are  not 
available).  Despite  a preliminary  training  period  for  the  task,  the  subjects  improved 
significantly  during  the  experiment.  The  sentence  accuracy  for  the  first  10  sentences 
was  35.77;  the  accuracy  for  the  last  10  sentences  was  62.17  (Word  accuracy  was  not 
given.) 

Although  the  experiment  does  not  indicate  how  much  bottom-up  word 
hypothesization  is  necessary  for  human  speech  understanding,  it  does  indicate  that 
humans  can  do  better  bottom-up  word  hypothesization,  than  machines  can.  Humans  are 
I able  to  use  the  acoustic  information  to  a great  degree  to  constrain  the  interpretations 

I of  speech. 

! 1 .2.5  Summary 

As  the  constraints  of  vocabulary  size,  syntax,  and  semantics  are  relaxed  for  a 
speech  understanding  system,  the  problems  of  speech  (e.g.,  coarticulation,  noise, 
pronunciation  variation,  etc.)  require  a greater  dependency  by  the  system  on  the 
acoustic  constraints  present  in  speech.  A bottom-up  word  hypothesizer  is  a component 
of  the  speech  system  which  uses  low  level  acoustic  information  (generally,  phone  or 
segment-label  hypotheses)  and  outputs  word  hypotheses  in  order  to  constrain  the 
possible  interpretations  of  an  utterance.  It  has  been  found  that  humans  have  the  ability 
to  use  to  a great  degree  (relative  to  current  speech  systems)  the  acoustic  information 
alone,  in  order  to  recognize  words  in  an  utterance. 

The  problems  of  word  hypothesization  for  a large  vocabulary  (and  to  some 
degree  the  goals  and  contributions  of  the  thesis)  are:  1)  Knowledge  Representation; 
storing  the  acoustic  knowledge  of  words  efficiently  for  fast  retrievals  2)  Performance; 
hypothesizing  many  of  correct  words  and  few  incorrect  ones  within  "reasonable" 


8 


computation  constraints;  3}  Knowledge  Acquisition;  obtaining  the  acoustic  knowledge  for 
a large  number  of  words  easily:  and  4)  Flexibility;  permitting  improvements  to  be  made 
to  the  acoustic  processors  of  the  system  (e.g.,  segmenter-labeter)  without  requiring  an 
expensive  reacquisition  of  knowledge. 


1.3  Previous  Research 

As  has  been  mentioned,  existent  bottom-up  word  hypothesizers  are  confined  to 
the  lexical  retrieval  component  of  the  HWIM  system  and  the  POMOW  word  hypothesizer 
of  Hearsay-IL  The  structure  and  performance  of  the  HWIM  hypothesizer  will  be 
compared  to  Noah  later  in  the  thesis  (see  Section  6.4.2);  this  section  is  limited  to  a brief 
description  of  the  design,  performance,  and  limitations  of  POMOW. 

1 J.i  Design  of  POMOW 

POMOW  introduces  an  intermediate  level  between  the  segment  level  and  the 
words.  At  this  new  level,  classes  of  syllables  called  svitvpes  are  hypothesized  based  on 
segment-label  hypotheses.  Syltypes  are  related  to  the  underlying  sequence  of 
segment-labels  by  using  a Markov  probability  model^.  Then,  for  each  syltype 
hypothesized,  all  words  containing  a syllable  which  is  a member  of  that  syltype  class 
are  suggested  for  hypothesization.  Multisyllabic  words  which  match  poorly  against 
adjacent  syltype  hypotheses  are  pruned.  We  discuss  below  the  definition  of  syltypes, 
how  the  Markov  probability  model  relates  them  to  a sequence  of  segment-labels,  and 
how  words  are  hypothesized  from  the  syltypes. 

Figure  1.2  gives  a sample  from  a Hearsay-II  word-phoneme  dictionary^.  The 
syntax  of  the  dictionary  permits  an  AND/OR  tree  of  the  possible  phonemes  in  a word. 
Parentheses  and  commas  indicate  an  OR  group  and  concatenation  of  the  elements 
(phonemes  or  syllables)  indicate  an  AND  group.  Angle  brackets,  "o",  are  syllable 
boundaries,  with  the  number  after  the  opening  bracket  giving  the  stress  level  of  the 
syllable.  (We  use  0,  1,  and  2 to  indicate  reduced,  normal,  and  stressed,  respectively, 
with  a default  stress  of  normal).  A in  any  OR  group  (whether  composed  of 
phonemes  or  syllables)  indicates  that  the  group  may  be  absent.  Many  phonological 
rules  have  been  put  into  this  dictionary  as  alternative  pronunciations.  For  example, 
"about*'  is  defined  (in  Figure  1.1)  to  start  with  an  optional  syllable  which,  if  present,  is  a 
reduced  schwa  (AX);  it  concludes  with  a stressed  syllable  made  up  of  the  three 
phonemes  "B",  "AW",  and  T*. 

5 The  idea  of  using  the  Markov  probability  model  came  from  the  Dragon  speech  system 
[Baker  - 1975X 

6 The  phonemes  are  give  in  a two  character  Arpabet  notation  — see  Appendex  A. 


1 Introduction 


9 


ABOUT  (<eRX>,*)<2B  RU  T> 

RCUPUNCTURE  <2RE  K><8y(UM,RX)><lP  RX  NX  X><8  (T  ,«)SH  ER> 
RCRICULTURE  <2RE  6x8  R (IH,IX)xtK  (flX.RR)  LxT  SH  ER> 
RIRPLRNE  <2(EH,EV>  RxP  L EV  N> 

RIRPLRNES  <2(EH,EY>  RxP  L EY  N 2> 

RKROH  <2RExX  R (RX  H ,EH>> 

RLBERTR  <RE  Lx2B  ERx<0X,T)  RX> 

RLCOHOL  <2RE  LxBK  RXx(HH,*)  HR  L> 

RLL  <R0  L> 

RLLIGRTOR  <2RExeL  (lH,IX>xlG  EYxB  (0X,T)ER> 

RNERICRN  <8(RX  H ,En>x2(n  ,*)EHx8R  (IH,IX)xX  (IX  N ,EN>> 
RNRLYSIS  <8RXx2N  REx8L  RXxS  (IH,IX>  S> 

RND  (<8  EN>,<(RE,  IX)  N (e,D>>) 

RNIRRLS  <2flExH  (IH.flX  .EnixH  (RX  L ,EL)Z> 

RNY  <l(IH,EH)x8N  (IY,IH>> 

RRCRRO  <RR  RxBX  EHxR  (RX,0U>> 


Figure  1.2:  Sample  from  a Word-Phoneme  Dictionary. 


The  definition  of  syltypes  is  based  on  grouping  the  phonemes  into  seven 
classes:  A-liKe  vowels,  I-like  vowels,  U-like  vowels,  liquids,  nasals,  stops,  and  fricatives. 
Figure  1.3  gives  the  class  membership  for  the  phonemes.  Each  class  contains  two 
states,  depending  on  which  side  of  the  syllable  nucleus  the  phoneme  appears  (e.g., 
phoneme  "T*  is  mapped  to  a STOPILEFT  state  if  it  preceeds  the  syllable  nucleus,  and  to 
a STOPIRIGHT  state  if  it  follows  the  nucleus).  Vowels  are  also  mapped  to  left  and  right 
states.  Typical  state  transitions  are  described  by  the  network  given  in  Figure  1.4.  For 
example,  let  the  above  phoneme  classes  be  represented  by  the  symbols  A,],U,L,N,P,  and 
F respectively  (as  in  the  first  column  of  Rgure  1.2).  The  word  "AIRPLANES",  with  the 
pronunciation  <EH  R>  <P  L EY  N Z>,  is  mapped  into  the  syltypes  IL  and  PLINF.  A unique 
path  in  the  network  corresponds  to  each  of  these  syltypes. 


CODE 

SYLTYPE 

PHONEHE 

R 

R-LIKE: 

RE,Rfl,RH,RO,RX 

I 

I-LIKEi 

IY,IH,EY,EH,IX,RY 

U 

U-LIKEt 

ON, UN, U, UU, ER, RU, DV, EL, EH, EH 

L 

LIQUID: 

Y.H.R.L 

N 

NRSRLt 

n,N,NX 

P 

STOP: 

P,T,X,B,D,G,DX, 

F 

FRICi 

HH,F,TH,S,SH,V,DH,Z,ZH,CH,JN,HH 

Figure  1.3:  Phoneme  Equivalence  Class 

The  Markov  probability  model  gives  a way  of  calculating  the  probability  of  each 
path  through  the  network  (i.e.,  each  syltype)  for  a segment-label  sequence.  The 
probabilities  used  in  the  model  are  inferred  training  utterances.  A training  program 


10 


r 


uses  the  phoneme  equivalence  classes  to  convert  phonetically  hand-labeled  utterances^ 
into  state  sequences.  Previous  results  from  the  Hearsay-II  segmenter-labeler  are 
aligned  with  these  states  so  that  frequency  counts  of  <current  state,  next  state,  next 
segment-label>  triples  can  be  made.  These  are  normalized  to  give  the  probability  of 
going  from  a current  state  to  a new  state  given  the  next  segment-label.  Thus, 
knowledge  acquisition  is  done  by  inferring  probabilities  from  sample  data. 

The  sequence  of  segment  labels,  and  therefore  the  syltype  state  network,  is  not 
traced  left  to  right  (i.e.,  in  the  direction  of  increasing  time)  but  from  the  syltype  nucleus 
out  to  the  ENDILEFT  state  and  then  from  the  nucleus  to  the  ENDIRIGHT  state,  as 
indicated  by  the  arrows  in  Figure  1.4.  This  gives  a starting  point  common  to  alternative 
syltypes  (since  all  syltypes  have  a nucleus). 

Once  we  have  syltypes  supported  by  the  observed  segment-labels,  we  must  find 
the  words  which  best  fit  the  syltypes.  The  ideas  behind  the  hypothesization  of  words 
from  the  syltypes  are  less  sophisticated:  Each  hypothesized  syltype  suggests  a set  of 
words  to  hypothesize.  A particular  word  is  included  in  the  set  if  it  has  a syllable  which 
maps  to  the  syltype  and  the  syllable  was  marked  in  the  word-phoneme  dictionary  as 


7 That  is,  a person  using  a transcription  of  the  utterance,  the  acoustic  waveform,  and 
perhaps  a spectrogram  makes  a phonetic  transcription  of  the  utterance. 


r 


1 Introduction 

having  enough  stress  to  reliably  indicate  the  word's  presence  in  the  utterance.  (The 
decision  of  whether  to  mark  a syllable  as  stressed  is  ad  hoc.)  Multisyllabic  words  in  this 
set  are  rejected  if  they  match  poorly  with  adjacent  syltype  hypotheses.  The  match 
uses  a measure  based  on  the  conditional  probability  that  a word’s  syltype  occurs  given 
that  a particular  hypothesized  s,yltype  is  observed.  These  probabilities  are  derived 
from  a model  of  syltype  ambiguity  and  the  data-derived  probabilities  of  segment-label 
to  phoneme  class  confusion.  Words  not  rejected  are  passed  to  the  word  verifier 
(Wizard  [McKeown  - 1977])  of  the  Hearsay-II  system.  Words  rated  well  by  the  verifier 
are  hypothesized.  Thus,  POMOW  is  used  as  a word  filter  for  the  word  verifier. 

1.3.2  Results  for  POMOW 

0 

The  performance  measure  relevant  to  P0)^W's  task  of  filtering  the  vocabulary 
for  the  word  verifier  is  the  number  of  correct  word  hypotheses  (i.e.,  hypotheses 
matching  utterance  words  in  name  and  position  but  not  necessarily  the  best-rated  word 
hypothesis  for  the  position)  and  the  total  number  of  words  hypothesized  per  utterance 
word.  The  following  results  are  based  on  testing  48  utterances,  none  of  which  were 
included  in  the  60  training  utterances.  A 1011-word  vocabulary  was  used.  These 
results  are  for  POMOW  alone;  the  results  for  POMOW  and  Wizard  as  a pair  will  be  given 
in  Chapter  6 in  comparison  with  Noah's  results. 

Performance 

Percent  of  words  in  utterance 
hypothesized  correctly: 

Avg.  number  of  words  hypothesiz 
per  utterance  word: 

Size 

Program: 

Permanent  storage: 

Uord  and  Syltype  Knowledge 
for  IBll-Uord  Vocabulary: 

Total: 

Computation  Costs 

Number  of  million  instructions 
per  second  of  speech: 

Times  real-time  for  a POP-KL10: 

1 .33  Limitations  of  POMOW 

POMOW's  inability  to  discriminate  between  words  beyond  that  permitted  by  the 


65% 

90 

20K  (36  bits) 
13K 

IIK 

44K 

9 

6.9 


12 


definition  of  syltypes  (equivalence  classes  of  syllables)  makes  its  performance  and 
computation  costs  very  sensitive  to  the  size  of  the  vocabulary.  An  increase  in  the 
vocabulary  from  520  words  to  1011  words  increases  the  number  of  unique  syltypes 
only  from  about  200  to  250.  Thus,  as  the  vocabulary  increases,  the  average  number  of 
words  that  are  supported  by  each  syltype  Increases  almost  as  fast,  causing  a rapid 
decrease  in  performance  and  speed. 

Another  problem  effecting  performance  is  with  syllable  equivalence  classes 
defined  by  the  syltypes.  A syllable  Is  a member  of  one  and  only  one  syltype  class; 
however,  in  practice  it  is  impossible  to  separate  syllables  into  strict  classes.  The 
problem  stems  directly  from  the  attempt  to  assign  the  phonemes  to  one  of  seven 
classes,  as  given  in  Figure  1.2.  Phonemes  which  tend  to  lie  "between”  classes  cause  a 
loss  in  discrimination  between  the  classes  for  POMOW,  which  results  In  a loss  in 
discrimination  between  different  syltypes.  This  contributes  to  the  high  figure  of  an 
average  of  90  word  hypotheses  (which  is  S%  of  the  vocabulary)  for  each  utterance 
word. 

The  solution  for  both  problems  is  to  increase  the  precision  of  the  syltypes, 
decreasing  the  number  of  syllables  per  syltype  class,  and  in  the  limit  to  use  syllables 
rather  than  syltypes.  This  permits  better  performance  at  the  cost  of  increasing  the 
computation  for  hypothesizing  syltypes  or  syllables  at  the  syllable  level.  However,  this 
may  be  more  than  balanced  by  a decrease  in  computation  for  hypothesizing  words  from 
the  syllables.  It  is  in  this  direction  that  the  design  of  Noah  was  taken. 

1.3.4  Conclusions  for  POMOW 

The  performance  of  POMOW  by  itself  (65^  correct  word  hypotheses  competing 
with  an  average  of  90  incorrect  hypotheses)  was  found  to  be  too  low  for  the  goals  of 
the  Hearsay-II  system.  For  this  reason,  the  Wizard  word  verifier  was  used  to  test  all 
syntactically  legal  initial  and  final  words  of  the  utterance.  This  raised  the  effective 
bottom-up  word  accuracy  for  the  system  to  above  75^  For  less  constrained  grammars 
and  larger  vocabularies,  this  would  not  be  feasible.  With  larger  vocabularies,  POMOW's 
performance  would  degrade  considerably,  its  computation  time  would  increase  almost 
linearly,  and  its  method  of  acquiring  necessary  word  pronunciation  variations  would 
become  unwieldy.  In  spite  of  these  problems,  POMOW  served  to  clarify  the  problems 
and  advantages  of  doing  bottom-up  word  hypothesization  and,  with  the  use  of  Wizard  at 
the  ends  of  the  utterance,  it  has  been  a key  component  of  Hearsay-IL 


1 Introduction 


13 


1.4  Overview  of  Noah 

The  Noah  word  hypothesizer  solves  POWOW’s  performance,  knowledge 
acquisition,  knowledge  storage,  and  computation  problems  for  large  vocabularies.  This  is 
done  by  basing  word  hypotheses  on  complete  syllables  rather  than  syltypes,  separating 
a priori  word  knowledge  (base  pronunciations  of  words)  from  segmenter-tabeler-specific 
knowledge,  and  storing  the  knowledge  in  a uniform  and  efficient  representation. 

Q 

Noah  uses  two  levels  between  the  input  level  of  segment-label  hypotheses"  and 
the  output  level  of  word  hypotheses,  where  POMOW  uses  one.  These  are  1)  the  svloart 
level,  consisting  of  parts  of  syllables  — onsets  (the  initial  non-nucleus  part  of  syllables), 
vowels,  and  codas  (the  final  non-nucleus  part  of  syllables)  --  and  2)  the  syllable  level, 
consisting  of  complete  syllables  (not  syltypes  as  found  in  POMOW).  Knowledge  is  stored 
in  a hierarchy-tree  representation.  That  is,  between  each  pair  of  adjacent  leveis 
(segment-sylpart,  sylpart-syllable,  and  syllable-word  level  pairs)  is  a tree  structure 
storing  a sequence  of  lower  level  units  to  define  a higher  level  unit.  The  last  node  of 
the  sequence  of  lower  level  units  points  to  the  defined  higher  level  unit.  For  example, 
the  syllable-word  tree  stores  sequences  of  syllables  defining  each  word  in  the 
vocabulary.  The  tree  between  each  pair  of  levels  permits  merging  common  initial  parts 
of  sequences  to  reduce  storage  costs  and  recognition  time.^  Thus,  the  words  "confide" 
and  "confuse"  share  the  first  syllable  node,  "con",  in  the  tree,  which  then  points  to 
subnodes  "fide",  "fuse",  etc.  (Section  3.2  gives  an  example  of  such  a tree.) 

The  knowledge  stored  in  the  two  higher  level  trees,  the  sylpart-syllable  tree 
and  the  syllable-word  tree,  is  obtained  by  processing  a word-phoneme  dictionary 
similar  to  the  one  used  by  POMOW.  However,  in  this  dictionary  only  a base 
pronunciation  is  given  for  each  word.  During  processing  of  the  dictionary  some  of  the 
words  also  receive  alternate  pronunciations  based  on  schwa  deletion  rules.  (Schwas  are 
reduced  vowels  that  are  sometimes  deleted  in  normal  speech.  An  example  is  the  second 
syllable,  "AX",  in  "summary"  (S  AX  M - AX  - R lY),  which  is  often  deleted  giving  S AX  M - 
R lY.) 


8 Segment-labels  were  described  in  Section  1.2.2  in  reference  to  POMOW's  input.  A 
more  complete  description  of  the  segmenter-labeler  used  by  Noah  (and  POMOW) 
appears  in  Chapter  2.  However,  it  is  important  to  note  that  the  design  of  Noah 
permits  it  to  use  speech  that  has  been  labeled  but  not  segmented  (i.e.,  uniformly 
divided  into  10ms  samples,  for  example),  although  at  an  increase  of  storage  and 
computation  costs.  (However,  this  mode  has  not  been  experimented  with.)  This  will  be 
discussed  in  Section  5.2.3. 1. 

9 The  Lexical  Retrieval  Component  of  the  HWIM  system  also  uses  a tree  between  the 
phone  level  and  the  word  level,  for  storing  word  pronunciations.  Noah  borrows  this 
idea  and  expands  it  to  a hierarchy  of  trees. 


A 


14 

The  knowledge  stored  in  the  segment-sylpart  trees^^  is  acquired  by  training  the 
system  on  speech  that  has  been  hand-segmented  into  onset,  vowel,  and  coda  segment 
patterns.  By  aligning  this  data  with  the  segment-labels  produced  for  the  speech  by  the 
same  segmenter -labeler  which  will  be  used  during  recognition,  segment  patterns  are 
learned  for  the  sylparts. 

Several  methods  are  used  to  handle  the  coarticulation  problems  common  at  these 
levels.  One  method  is  the  ability  to  learn  a vowel-seouence.  which  may  occur  when 
more  than  one  syllable  (as  hand-segmented)  share  the  same  syllable  nucleus  (as 
detected  by  the  segmenter-labeler).  A second  is  context-learning,  which  involves 
learning  the  surrounding  segment-context  of  a segment-pattern  in  order  to  account  for 
variations  in  the  segment-patterns  learned  for  a sylpart. 

Word  hypothesization  in  Noah  is  a bottom-up  recognition  process  through  four 
levels:  1)  Syllable  nuclei  are  recognized  at  the  segment  level;  2)  Vowels,  onsets,  and 
codas  are  hypothesized  and  rated  at  the  sylpart  level,  based  on  the  segment  labels;  3) 
syllables  are  hypothesized  and  rated  at  the  syllable  level,  based  on  the  sylpart 
hypotheses;  and  4)  words  are  hypothesized  at  the  word  level,  based  on  the  syllable 
hypotheses.  The  recognition  algorithm  between  each  pair  of  levels  is  very  similar:  A 
search  is  made  through  each  tree'  based  on  the  lower  level  hypotheses.'  Whenever  such 
a path  defines  a higher  level  unit,  that  unit  is  hypothesized.  Ratings  for  the  lower  level 
hypotheses  direct  the  search  and  determine  the  rating  of  the  higher  level  hypothesis. 

1.5  Organization  of  Chapters 

The  chapters  follow  the  above  overview  of  Noah.  Chapter  2 describes  the 
hierarchy-tree  representation  by  a simple  example  and  as  a member  of  a series  of 
representation  structures.  Various  measures  of  the  effectiveness  of  this  representation 
in  reducing  storage  and  recognition  costs  and  for  the  analysis  of  the  recognition 
algorithm  are  also  given  in  the  chapter.  Since  one  of  the  measures  depends  on  the 
performance  of  the  segmenter-labeler  used  by  Noah,  a description  of  segmenter-labeler 
is  also  presented. 

Chapter  3 describes  the  hierarchy-tree  representation  applied  to  Noah,  showing 
also  the  storage  of  auxiliary  information.  Other  representations  used  in  speech 
recognition  systems  are  also  discussed. 

The  acquisition  of  knowledge  is  the  topic  of  Chapter  4.  The  dictionary 
knowledge  stored  in  the  sylpart-syllable  tree  and  the  syllable-word  tree  and  the 

10  Three  trees  are  used  between  the  segment  and  sylpart  levels,  corresponding  to  the 
three  part  division  of  syllables  into  onsets,  vowels,  and  codas. 


1 Introduction 


15 


segment-pattern  Knowledge  stored  in  the  segment-sylpart  tree  is  presented. 

Chapter  5 gives  the  major  steps  in  the  recognition  algorithm  and  explains  how 
the  ratings  are  computed  for  the  hypotheses  at  each  level. 

The  performance  and  runtime  characteristics  of  Noah  are  given  in  Chapter  6. 
This  performance  is  analyzed  to  try  to  understand  how  it  is  effected  by  the  vocabulary 
size,  the  amount  of  training,  and  various  characteristics  of  the  vocabulary  words.  This 
performance  is  compared  to  the  performances  of  POMOW-Wizard  and  the  HWIK^  Lexical 
Retrieval  Component. 

The  final  chapter  includes  a summary,  some  conclusions,  and  a look  at  future 
work  for  bottom-up  word  hypothesizatioa 

1.6  Hints  to  the  Reader 

For  the  reader  who  wants  to  take  this  thesis  in  measured  doses,  the  following 
is  suggested. 

> Small  dose:  Chapter  I (omitting  Section  1.3  on  POMOW),  Section  6.1,  for  the  discussion 
of  performance  measures  and  test  conditions,  and  Chapter  7. 

> Medium  dose:  Chapter  1,  first  two  sections  of  Chapter  2,  first  four  sections  of  Chapter 
3,  Chapter  4 through  Section  4.3.2  and  Section  4.3.4  on  context  learning,  Chapter  5 
through  Section  5.2.2,  Chapter  6,  and  Chapter  7. 

> Maximum  dose:  All  chapters,  but  perhaps  taking  them  first  in  the  doses  prescribed 
above. 


2.1  Introduction 


The  intent  of  this  chapter  is  to  present,  by  simple  example,  the  structure  of  the 
knowledge  representation  used  in  Noah  and  to  describe  tools  for  analyzing  the 
effectiveness  of  the  knowledge  representation.  In  order  to  evaluate  the  representation 
we  will  want  to  bo  able  to  measure  the  efficiency  of  knowledge  storage,  the  efficiency 
of  knowledge  retrieval,  and  the  propagation  of  error  occurring  during  knowledge 
retrieval. 

2.2  Knowledge  Representation  Structures 

The  purpose  of  the  word  hypolhesizer  is  to  convert  an  input  of  segment  labels 
into  an  output  of  word  hypotheses.  The  basic  knowledge  needed  to  do  this  is  the 
sequence  of  sounds  which  make  up  each  word.  In  particular,  the  hypothesizer  needs  to 
store  a sequence  of  symbols  for  each  word  which  can  be  compared  to  an  input 
sequence  of  segment  labels.  A word  is  recognized  (correctly  or  incorrectly)  by  a close 
matching^  of  the  stored  sequences  of  symbols  for  the  word  with  the  input  sequence  of 
segment  labels. 

An  obvious  representation  for  storing  sequences  is  illustrated  by  the  "toy" 
example  of  Figure  2.1.  The  defining  sequence  (in  this  case  a string  of  lower  case 
letters)  is  stored  explicitly  for  each  word. 


1 If  the  symbols  stored  for  a word  represent  the  ideal  sounds  in  the  word  (e.g., 
phonemes)  then  the  matching  of  stored  symbols  to  segment  labels  will  use  a distance 
metric  between  each  symbol  and  segment  label.  However,  if  the  symbols  stored  for 
each  word  are  actually  segment  labels,  then  the  match  simply  Is  a comparison  of 
stored  labels  and  input  labels.  (Input  segment  labels  may  include  a rating  indicating 
how  likely  the  label  really  occured  in  the  speech.) 


18 


U1  : abode 

U2  I dedfg 

U3  : abcdfg  Storage  coett  25  eyebole 
U4  : deabc  Recognition  cost: 

21  matches  per  input  symbol 


Figure  2.1:  Simple  Storage  Structure 


( 

i- 

f 

[■ 

i 

I 

I 


The  cost  of  recognition  is  based  on  a top-down  recognition  algorithm  which  1) 
matches  all  stored  sequences  entirely  before  the  closest  matching  word  is  chosen^  and 
2)  does  not  use  word  boundary  information  - in  particular  the  ends  of  the  input 
sequence  - to  constrain  word  matching;  every  stored  sequence  is  matched  at  every 
position  in  the  input  sequence.  With  a very  large  vocabulary,  two  necessary  goals  of 
the  knowledge  representation  are  efficient  storage  of  the  sequence  for  each  word  and 
efficient  matching  of  a new  input  sequence  with  all  stored  sequences.  This 
representation  does  not  meet  either  goal. 


One  method  of  reducing  storage  is  to  find  common  subsequences  in  the  words 
and  then  to  replace  each  subsequence  by  a new  symbol  as  is  done  in  text  file 
compression  [Rubin-1976].  If  L is  the  length  of  the  subsequence  and  N is  the  number  of 
times  it  appears,  then  the  storage  is  reduced  by  N*L  - (L  N)  cells.  When  all 
^ subsequences  are  replaced  by  a new  symbol,  a three  level  hierarchy  structure  is 

I formed  to  represent  the  knowledge.  Figure  2.2  shows  the  effect  of  this  on  the  example. 


U1  : 51  52  51  : abc 

U2  : 52  53  52  : de 

U3  ! 51  53  S3  : dfg 

U4  : 52  51 

Storage  cost:  23  symbols 

Recognition  cost:  11  matches  per  Input  symbol 


I Figure  2.2:  Hierarchy  Structure^ 

i 

[ Though  this  structure  reduces  storage  and  recognition  costs,  it  does  so  by  using 

two  assumptions  which  are  possible  weak.  The  first  assumption  is  that  the  realization 


2 In  practice  heuristics  can  be  developed  to  abort  the  matching  of  a sequence  based  on 
the  partial  score  of  its  initial  match.  Since  such  heuristics  work  as  well  (or  perhaps 
better)  with  the  other  structures  presented  here,  we  will  not  IrKlude  their  use  in  the 
cost  estimate.  The  example  itself  was  chosen  to  illustrate  the  relative  advantages  of 
each  structure  when  used  to  store  knowledge  for  word  hypothesization.  Other 
examples  can  be  found  to  favor  a particular  representatioa 

3 The  recognition  cost  is  explained  in  the  next  section. 


2 Structures  and  Measures 


19 


of  a subsequence  is  independent  of  the  surrounding  context.  This  is  true  in  text 
compression  applications  where  there  is  no  uncertainty  concerning  the  input.  In  speech 
recognition,  however,  we  must  be  sure  that  the  subsequences  are  context  independent, 
or  we  must  at  least  adjust  for  any  dependency.  The  second  assumption  (which  accounts 
for  the  reduction  in  recognition  cost)  is  that  the  input  sequence  can  be  separated  into 
regions  such  that  each  contains  one  subsequence  (one  Sj  in  the  example).  When  this 
assumption  fails,  the  hierarchy  structure  increases  rather  then  decreases  the 
recognition  cost. 


Storage  cost:  19  symbols  19  pointers 
Recognition  cost:  15  matches  per  input  symbol 

Figure  2.3:  Tree  Structure 

Another  method  of  reducing  recognition  and  possibly  storage  costs  is  to  let 
sequences  share  common  initial  sequences  by  merging  all  sequences  into  a tree 
structure  as  shown  in  Figure  2.3^.  The  storage  of  symbols  has  decreased  but  total 
storage  may  increase  with  the  addition  of  pointers.  Note  that  the  terminal  nodes  of 
each  path  points  to  a "leaf  cell  which  contains  the  word  defined  by  the  path.  (A 
nonterminal  node  can  also  point  to  a leaf;  consider  storing  "W5"  defined  by  "ded"). 

It  is  possible  to  combine  the  recognition  savings  of  the  hierarchy  structure  and 
the  tree  structure  by  forming  a tree  between  each  level  of  the  hierarchy  and  creating 
what  we  will  call  the  hierarchy-tree  structure.  We  show  an  example  of  this  structure  in 
Figure  2.4.  We  can  explain  the  decrease  in  recognition  costs  of  this  structure  over  the 
previous  two  structures  in  two  ways.  From  one  viewpoint,  the  hierarchy-tree  structure 
reduces  the  recognition  cost  of  the  hierarchy  structure  by  storing  the  sequences  at 
each  level  in  the  more  efficient  tree  structure.  From  another  viewpoint,  we  can  say  that 


4 See  [Knuth-1973]  Vol.  3,  pp.  481  and  following  for  a description  of  • Trie"  structure 
which  Is  also  used  to  store  this  type  of  Information. 


20 


->W1 

->W3 

->W2 

->W4 


Storage  cost:  20  symbols  *■  20  pointers 
Recognition  cost:  9 matches  per  input  symbol 

Figure  2,4:  Hierarchy-Tree  Structure 

the  hierarchy-tree  structure  reduces  the  recognition  cost  of  the  tree  structure  by  using 
more  but  shorter  sequences  to  store  the  information.  Thb  «iivings  of  the  tree  structure 
occur  at  the  initial  part  of  each  sequence,  so  that  shorter  sequences  permit  greater 
savings.  The  above  reasoning  would  also  hold  for  storage  costs  if  we  did  not  include 
the  cost  of  storing  pointers.  Chapter  3 describes  how  Noah  uses  a hierarchy-tree 
structure  (with  four  levels)  to  represent  its  Knowledge. 


23  Measures  of  Storage  and  Recognition  Costs 

To  speak  precisely  about  different  representations,  we  need  to  measure  their 
storage  and  recognition  costs.  In  particular,  we  will  be  concerned  with  the  performance 
of  the  hierarchy-tree  representation  compared  to  the  simple  storage  structure.  This 
comparison  is  done  in  two  steps  for  both  storage  and  recognition  costs.  In  the  first 
step,  the  storage  and  recognition  costs  for  each  level  of  the  hierarchy  are  compared  to 
the  storage  and  recognition  costs  incurred  if  the  particular  level  had  been  omitted  (i.e., 
if  the  simple  storage  representation  had  been  used).  In  the  second  step,  the  storage 
and  recognition  costs  of  each  tree  between  two  adjacent  levels  are  compared  to  the 
storage  and  recognition  costs  incurred  if  the  tree  had  been  omitted  (i.e.,  if  the  simple 
storage  representation  had  been  used).  In  each  case,  ratios  of  the  costs  show  the 
comparison.  These  measures  will  be  applied  to  the  structures  in  Noah  in  Section  3.5. 

Consider  a hierarchy-tree  representation  with  D levels  numbered  from  D (the 
highest  level)  to  1 (the  lowest  level).  Let  U,  be  the  number  of  unique  symbols  at  level  i, 
N|  j be  the  total  number  of  symbols  at  level  i needed  to  define  the  symbols  at  level  j 
(i<j,  i.e.,  the  sum  of  the  lengths  of  the  sequences  at  level  i),  L, ^ be  the  average  length 
of  the  sequences  of  level  i defining  the  symbols  of  level  j,  and  T,  ^ be  the  number  of 


1 


w 


2 Structures  and  Measures 


21 


nodes  used  by  tree  between  levels  i and  j (i-j-1)  to  store  the  sequences  of  symbols  of 
level  i (needed  to  describe  the  symbols  at  level  j).  The  ratio  of  storage  costs  when 
using  level  i in  a hierarchy  to  the  storage  costs  found  when  level  i is  omitted  is: 

N 1-1,1 '082  Un  + Ni,i»l'082  U, 

Eq.  2.1  Hierarchy  Storage:  HS  

Nn.l*l '082  ^^1 

The  ratio  of  recognition  costs  for  level  I of  a hierarchy  to  the  recognition  costs 
found  when  level  I is  omitted  is: 


Eq.  2.2  Hierarchy  Recognition:  HR  j - 


'''l-l.i  ■*'  '''iifl/ Lj-J,  1 


N 


The  storage  cost  ratio  is  the  cost  of  storing  the  sequences  of  symbols  of  level  i- 
1 defining  the  symbols  of  level  i (expressed  in  the  number  of  bits  of  storage^),  plus  the 
cost  of  storing  the  sequences  of  symbols  of  level  i defining  the  symbols  of  level  i-t-l, 
divided  by  the  cost  of  storing  the  sequences  of  symbols  of  level  i-1  to  define  the 
symbols  of  level  i-fl,  which  would  be  necessary  if  level  i was  omitted. 


The  recognition  cost  ratio  is  more  of  an  estimate.  The  number  of  matches  per 
input  symbol  of  level  i-1,  which  are  required  to  recognize  the  symbols  of  level  i-rl,  are 
computed  with  level  i (in  the  numerator)  and  without  level  i (in  the  denominator)  to  give 
the  ratio.  When  using  level  i,  two  levels  of  matches  are  made.  First  all  stored 
symbols  of  level  i-1  must  be  matched  with  each  input  symbol  to  recognize  symbols  of 

level  i.  Then  the  stored  symbols  of  level  i must  be  matched  with  these  newly 

recognized  symbols  of  level  I to  recognize  symbols  of  level  I-fl.  However,  this  second 

cost  must  be  adjusted  to  get  the  cost  per  input  symbol  of  level  i-1.  This  is  done  by 

dividing  the  second  cost  by  the  average  number  of  level  i-1  symbols  for  each  level  i 
symbol  (i.e.  the  average  length  of  the  sequences  at  level  i-1  defining  the  symbols  of 
level  i)®.  For  example,  the  recognition  cost  for  Figure  2.2  is  8 (8  / 2.67)  -11  symbols 

per  input  symbol. 


Given  two  levels  of  the  hierarchy,  i-1  and  i,  the  ratio  of  the  storage  cost  found 
when  using  a tree  to  store  the  sequences  of  level  i-1  to  the  storage  costs  when  using 
the  simple  storage  structure  is: 


Eq.  2.8  Tree  Storage:  TS:  - 


^ '082^1  * '®82^i,l»l) 


Ni.l.l'082U, 


5 To  be  precise,  the  least  integer  greater  than  LogjUj  should  be  used  instead  of 
Log2U|,  but  since  these  equations  will  be  used  to  compare  the  representation  for 
different  data  bases,  we  will  use  the  continuous  function. 

6 This  falsely  assumes  that  the  sequences  of  level  i-1  defining  the  symbols  of  level  I 
are  equally  likely  to  occur  in  the  input. 


22 


The  ratio  of  recognition  costs  is: 

Eq.  2.4  Tree  Recognition:  TR,  ■ T, 

The  storage  cost  for  a tree  includes  the  cost  (togjTjj^l}  of  storing  one  pointer 
for  each  symbol. 


2.4  The  Confusion  of  Hypotheses 

The  recognition  algorithm  tor  the  hierarchy-tree  structure  uses  the  tree  at  each 
level  to  convert  an  input  sequence  ot  units  at  one  level  into  a new  sequence  ot  units  at 
the  next  higher  level.  The  recognition  is  complicated,  however,  by  the  nature  ot  the 
lowest  level  input  obtained  from  the  segmenter-labeler.  Since  the  segmenter-labeler  is 
using  only  part  of  the  information  necessary  tor  speech  recognition  (i.e.,  a localized  part 
of  the  acoustics,  but  not  the  full  utterance,  nor  syntax,  semantics,  prosodies,  etc.),  its 
segmentation  and  labeling  is  uncertain.  It  could  communicate  this  labeling  uncertainty 
by  putting  out  the  best  label  choice  tor  each  segment  and  letting  the  rest  ot  the  speech 
system  use  a precomputed  label-to-label  distance  metric,  or  it  could  output  a list  ot 
labels  (as  competing  hypotheses)  tor  each  segment  together  with  a corresponding  list  of 
ratings.  The  second  method  loses  less  information  then  the  first  and  is  used  by  the 
Hearsay-II  segmenter-labeler  [Goldberg,  Reddy,  and  Gill  - 1977].  This  uncertainty  in 
the  lowest  level  input  propagates  up  to  higher  levels  of  the  speech  system  until  other 
knowledge  can  constrain  the  possible  hypotheses  to  hopefully  the  correct  sequence  of 
words  and  the  correct  semantic  interpretation.  In  order  tor  the  recognition  algorithm  of 
Noah  to  handle  this  uncertainty,  it  must  be  able  to  search  many  paths  simultaneously  in 
each  tree  and  store  the  best  sequences  at  each  level.  We  discuss  the  method  ot  doing 
this  in  Chapter  5. 

One  goal  of  this  thesis  was  to  develop  means  of  analyzing  how  the  uncertainty 
ot  one  level  is  reduced  (or  maybe  increased)  as  recognition  proceeds  to  the  next  higher 
level.  One  can  view  the  tree  between  levels  i and  14-1  as  a syntax  filter  on  the 
uncertainty  ot  the  data  at  level  i to  obtain  less  uncertain  data  at  level  l+l.  We  need  a 
way  of  measuring  the  uncertainty  of  the  information  in  the  list  of  symbols  at  each 
position  tor  each  level,  in  order  to  see  how  each  step  in  the  recognition  algorithm 
contributes  to  the  goal  ot  reducing  the  uncertainty  of  the  acoustic  information  to  the 
correct  sequence  ot  words  tor  the  utterance. 

In  the  remainder  of  this  section,  we  first  give  a brief  description  of  the 
segmenter-labeler  in  order  to  understand  the  nature  of  the  input  for  Noah.  We  then 


2 Structures  and  Measures 


23 


describe  a measure  for  the  confusion  of  hypotheses,  which  will  be  applied  in  Section  5.4 
to  the  segment  labels  and  the  hypotheses  generated  by  Noah  in  order  to  analyze  the 
recognition  algorithm. 

2.4.1  Segmenter-Labeler  Hypotheses 

The  following  description  of  the  segmenter-labeler  used  by  Noah  is  taken  from 
[Erman  - 1977]. 

Four  parameters  are  derived  by  simple  algorithms  operating 
directly  on  the  digitized  audio  signal  (9  bit  sampled  at  10  KHz.)  and  are 
used  by  the  segmenter  as  the  basis  for  an  acoustic  segmentation  and 
classification  of  the  utterance.  This  segmentation  is  accomplished  by  an 
iterative  refinement  technique:  First  silence  is  separated  from  non- 
silence; then,  the  non-silence  is  broken  down  into  the  sonorant  and  non- 
sonorant  regions,  etc.  Eventually,  five  classes  of  segments  are  produced: 
silence,  sonorant  peak,  sonorant  non-peak,  fricative,  and  flap.  Associated 
with  each  classified  segment  is  its  duration,  absolute  amplitude,  and 
amplitude  relative  to  its  neighboring  segments.  The  segments  are 
contiguous  and  non-overlapping,  with  one  class  designation  for  each.  (A 
slightly  finer  classification  of  these  segments  produces  the  segment  class 
labels  used  by  Noah  for  identification  of  syllable  nuclei  — discussed  in 
Section  4.3.3.I.) 

The  labeler  does  a finer  labeling  of  each  segment.  The  labels  are 
allophonic-like;  there  are  currently  98  of  them  (see  Appendix  X).  Each  of 
the  98  labels  is  defined  by  a vector  of  auto  correlation  coefficients 
[Itakura  - 1975].  These  templates  are  generated  from  speaker- 
dependent  training  data  that  have  been  hand-labeled.  The  result  of  the 
labeling  process,  which  matches  the  central  portion  of  each  segment 
against  each  of  the  templates  using  the  Itakura  metric,  is  a vector  of  98 
numbers;  the  i’th  number  is  an  estimate  of  the  (negative  log)  probability 
that  the  segment  represents  an  occurrence  of  the  i’th  allophone  in  the 
label  set. 

Evaluating  the  performance  of  the  segmenter-labeler  is  difficult.  This  is  due,  in  part,  to 
the  difficulty  of  setting  a standard  for  comparision.  One  method  that  has  been  used  is 
to  compare  the  segment  label  output  to  a hand-made  segment  label  description  of  the 
words  in  an  utterance.  This  is  done  automatically  by  using  the  Harpy  speech  system 
(which  uses  the  same  segmenter-labeler)  in  a "forced  recognition"  mode  as  follows:  the 
Harpy  speech  system  recognizes  an  utterance  by  finding  the  path  through  a network  of 
labels  that  matches  best  with  the  output  of  the  segmenter-labeler  (i.e.,  the  combined 
rating  of  all  the  labels  defined  by  the  path  is  best).  The  network  combines  a description 
of  all  possible  sequences  of  words  which  the  grammar  permits  and  all  possible 
sequences  of  labels  for  each  word  which  the  word-allophone  dictionary  permits.  By 
finding  the  best  path  through  the  sequences  of  labels  for  the  correct  words  of  an 

\ 

\ 


24 


utterance,  a "correct"  sequence  of  labels  is  defined.  This  "correct"  sequence  can  be 
used  to  measure  the  distribution  of  the  "correct"  labels  from  the  segmenter -labeler  by 
rank  or  by  rating.  Thus,  the  standard  for  correct  labels  is  derived  from  a hand-made 
word-allophone  dictionary  in  which  several  alternate  labels  are  given  for  each 
allophone-position  in  the  word.  ([Lowerre  - 1976]  gives  this  dictionary  in  an  appendix.) 

The  following  performance,  based  on  over  26,000  segments  of  speech,  is  . 
observed  [Gtoldberg  - 1977]: 

Rank  of  label:  12345678 

Cumulative  Accuracy:  42X  58X  BSX  71%  7SX  77X  80X  81Z 

The  distribution  of  accuracy  for  the  rating  (a  value  between  0 and  127)  of  the 
correct  label  minus  the  rating  of  the  best  label  (which  will  have  the  lowest  value  of  a 
set  of  labels)  in  groups  of  10: 

Rate(correct)  - RateCbeet):  0-»9  10-»19  20-*29  30-»39  40-*49  50-»127 
Accuracy  in  group:  57X  12%  9X  6X  4X  12X 

Thus,  for  example,  6X  of  the  time  the  correct  label  is  rated  worse  than  the 
best-rated  label  by  30  to  39  points. 

2.4.2  A Measure  of  the  Confusion  of  Hypotheses 

How  can  we  measure  the  uncertainty  of  the  98  labels  for  a segment  and  for 
other  competing  hypotheses  which  are  generated  by  Noah?  One  measure  which 
immediately  comes  to  mind  is  Shannon’s  entropy  measure  [Shannon  - 1948].  The 
entropy  measure  is  restricted  to  mutually  exclusive  events,  but  the  segment  labels  are 
not  viewed  as  mutually  exclusive  events  by  the  segmenter-labeler.  Rather,  it  attempts 
to  measure  the  likelihood  of  each  label  occuring  in  a segment  of  speech  and,  in  general, 
this  is  done  independent  of  the  likelihood  that  another  label  occurred  in  the  same 
segment.  In  effect,  the  segmenter-labeler  assigns  oseudo  orobabilites  as  the  likelihood 
measure  for  a label.  A pseudo  probability  is  a likelihood  measure  which  has  meaning  on 
a relative  scale  but  not  on  an  absolute  scale.  For  example,  if  hypothesis  h^  has  a 
pseudo  probability  of  1.0  and  hypotheses  h^  has  pseudo  probability  of  0.5,  one  can  say 
that  h|  is  twice  as  likely  of  being  correct  than  hj.  However,  in  general  a probability  of 
1 does  not  mean  that  the  hypothesis  is  "certainly”  correct;  nor  need  the  probabilities  of 
competing  hypotheses  sum  to  unity. 

If  we  consider  the  pseudo  probablities  (p^tPjr-iPk)  assigned  to  a set  of 
competing  hypotheses  (h^,h2,...,h,()  by  a knowledge  source  (e.g.,  the  segmenter-labeler) 
to  be  an  accurate  estimate  of  reality  for  a set  of  input  conditions,  then: 

2 "i 

F(h,)  . 

^ » 


gives  the  average  number  of  times  the  same  conditions  must  hold  in  a series  of  tests  in 
order  for  hypothesis  h,  to  be  correct  once^.  We  define  the  number  F(h|)  to  be  a 
measure  of  the  competition  in  a set  of  hypotheses  for  hypothesis  h,.  For  example,  if 
three  competing  hypotheses  with  pseudo  probabilities  P|-.9,  Pj^-Si  and  P3*‘.45  are 
hypothesized  by  a knowledge  source  under  a set  of  conditions  (and  the  probabilities  are 
an  accurate  estimate  of  reality),  then  the  same  conditions  must  occur  2.5  times 
(A.9*.9+.^5)/.9)  on  the' average  for  every  time  h^  (or  hj)  is  correct  and  5 times 
(•>.9-»'.9+.45)/.45)  on  the  average  for  every  time  hj  is  correct.  Thus,  the  competition  for 
h^  (and  for  hj)  is  2.5;  tha  competition  for  hj  is  5.  For  a set  of  k competing  hypotheses 
having  equal  pseudo  probabilities,  the  competition  for  each  hypothesis  is  k.  Therefore 
the  competition  measure  of  an  hypothesis  gives  the  equivalent  number  of  equally 
probable  competing  hypotheses. 


We  define  the  confusion  for  a set  of  hypotheses  to  be  the  competition  in  the  set 
for  the  hypothesis  with  the  greatest  pseudo  probability.  Thus,  the  confusion  measure 
for  a set  of  hypotheses  equals  the  competition  for  the  best  hypothesis  of  the  set: 


max  p 


The  confusion  measure  is  simpiy  the  recipocal  of  the  normalized  probablity  of 
the  best  hypothesis  in  the  set  (where  the  probablities  are  normalized  to  sum  to  1).  The 
measure  is  therefore  unchanged  by  a multiplication  of  the  probablities  of  the  set  by  a 
constant. 


In  Noah  a new  hypothesis  at  one  level  is  formed  by  concatenating  two  or  more 
adjacent  hypotheses  and  giving  the  new  hypothesis  a probability  equal  to  the  product 
of  the  probablities  of  the  oid  hypotheses.  We  now  show  that  the  confusion  of  a set  S 
of  hypotheses  (computed  by  (3(8)),  formed  from  all  possible  pairs  of  hypotheses  from 
two  sets  of  adjacent  hypotheses,  T and  U,  equals  G(T}  times  (a(U).  Let  the  pseudo 
probabilites  for  the  two  sets  be  (P|,P2,...,P,()  and  (q^,q2t-iPH)  > the  probablites  for  the 
new  set  S will  be 

Prob(S)  - (Piqir-.Pi<l|,iP2'li»™.P29,iP|(«lir.  >P,tq.). 

The  confusion  is: 

max  p.q 
ii  ' * 

7 Since  a knowledge  source  may  have  only  partial  information  on  which  to  base  its 
decisions  (as  with  the  segmenter-labeler),  it  is  reasonable  to  assume  that  different 
hypotheses  will  be  correct  at  different  times  for  the  same  partial  information. 


i 


r 


I 


26 


Simplifying  the  sum  gives: 


G(S)- 


q. 

r»l,k  ^ 


max  p max  q 
« ‘ J * 


G(T)  X G(U) 


The  segment-label  ratings  generated  by  the  segmenter-labeler  can  be  converted 
to  "accurate"  pseudo  probabllltes  by  using  a performance  evaluation  of  the  segmenter- 
labeler.  The  rating  of  a segment  label  is  converted  by  subtracting  from  it  the  rating  of 
the  best-rated  label  in  its  segment  and  looking  up  for  the  result  an  observed  accuracy 
in  a table  similar  to  the  final  table  of  Section  2.4.1.  This  accuracy  value  is  used  as  the 
pseudo  probability  of  the  label.  The  pseudo  probabilitites  for  the  segment  labels  and 
for  other  hypotheses  produced  by  Noah  can  then  be  used  to  estimate  the  confusion  for 
sets  of  competing  hypotheses.  In  Chapter  5,  we  trace  the  confusion  of  hypotheses  from 
the  input  of  segment  labels  to  the  output  of  word  hypotheses  for  the  Noah  recognition 
algorithm. 


Chapter  3;  Repre»entatlon  of  Knowledge 


3.1  Introduction 

The  patterns  describing  the  vocabulary  words  known  to  Noah  are  stored  in  a 
hierarchy-tree  structure  with  tour  levels.  These  levels  from  top  to  bottom  are  the 
word  level,  the  syllable  level,  the  svbart  level  (containing  onsets,  vowels,  and  codas  ^), 
and  segments.^  Conceptually,  one  tree  is  used  to  join  each  adjacent  pair  of  levels; 
however  the  segment  level  is  joined  to  the  sylpart  level  by  three  separate  trees 
corresponding  to  the  three  part  division  of  the  syllables  into  onsets,  vowels,  and  codas. 

In  this  chapter,  we  give  examples  of  the  hierarchy-tree  structure  applied  to  the 
Knowledge  in  Noah  (Section  3.2),  attempt  to  justify  the  levels  (Section  3.3),  and  show 
what  auxiliary  information  is  stored  in  the  representation  (Section  3.4).  Measures  of 
storage  and  recognition  costs,  developed  in  Chapter  2,  are  applied  in  Section  3.5  and 
the  actually  storage  costs  are  given  in  Section  3.6.  Finally  in  Section  3.7  we  compare 
this  representation  to  others  being  used  in  speech  recognition. 

3.2  An  Example 

Consider  the  sample  dictionary  of  Figure  3.1.  Each  word  is  followed  by  its 
pronunciation  given  in  two  character  ARPA6ET  Computer  phonetic  representation^,  with 
hyphens  indicating  syllable  boundaries.  The  second  syllable  of  "ACM",  for  example,  is 
made  up  of  the  vowel  "lY”,  the  onset  "S”,  and  a null  coda.  Parentheses  enclose  a list  of 
options  separated  by  a comma  with  a "«"  representing  a null  option.  In  this  sample 
dictionary,  the  first  syllable  of  "about"  can  be  dropped  optionaly. 


1 An  onset  is  the  initial  nonnucleus  part  of  a syllable;  a coda  is  the  final  nonnucleus 
part. 

2 We  will  use  the  term  "segment"  to  mean  a labeled  segment  of  speech,  and  "segment 
label"  to  refer  to  one  of  possibly  many  labels  assigned  to  a segment. 

3 See  Appendix  A for  the  correspondence  between  the  ARPA6ET  representation  and 
the  International  Phonetic  Alphabet. 


28 


A EY 

ABOUT  (AX-  ,*)  B AU  T 

ABSTRACT  AEB-STRAEKT 

ABSTRACTION  AE  B - S T R AE  K - SH  IX  N 
ABSTRACTS  AEB-STRAEKTS 

ACL  EY  - S lY  - EH  L 

ACM  EY  - S lY  - EH  n 

Figure  3.1:  Sample  Dictionary 


Syllables 


Word 


..-P/£A>  I 

1 

-1 

_-i><ACM>  I 

1 

J 

_-^<ACL>  I 

i 

1 

..-><ABSTRACTS> 

- - ^ <ABSTRACTION> 
-.^><ABSTRACT> 

. . - ^<ABOUT> 

_ _ <ABOUT> 


Figure  3.2;  Syllable-Word  Tree  for  Sample  Dictionary 


3  Representation  of  Knowledge 


29 


The  syllable-word  tree  (i.e.,  the  tree  joining  the  syllable  level  to  the  word  level) 
for  this  small  dictionary  is  shown  in  Figure  3.2.  In  this  figure  solid  lines  indicate 
pointers  between  nodes  of  the  tree^  and  dashed  lines  indicate  pointers  to  leaves. 
Associated  with  each  level  is  a lexicon  of  items  used  to  describe  speech  at  that  level. 
For  example,  the  syllable  level  has  a corresponding  lexicon  of  syllables.  One  of  the 
items  in  this  lexicon  is  the  syllable  "B  AW  T".  (The  angle  brackets  in  the  figure  indicate 
a unique  lexicon  number  for  the  enclosed  quantity;  the  syllable  itself  is  not  stored  at  a 
node,  but  rather  its  lexicon  number.)  It  is  possible  for  one  path  of  nodes  to  point  to 
more  than  one  leaf  if  words  share  the  same  pronunciation  (homonyms).  Also,  it  is 
possible  for  the  same  word  to  appear  on  separate  leaves  if  it  has  distinct 
pronunciations  (like  "About"  in  this  sample  dictionary). 

Figure  3.3  gives  the  corresponding  sylpart-to-syllable  tree.  The  main  thing  to 
notice  about  this  tree  is  that  the  sylparts  of  the  paths  do  not  follow  the  left-to-right 
order  found  in  a syllable.  The  vowels  are  put  first  to  simplify  recognition  (which  will  be 
discussed  in  Chapter  5).  Each  path  in  this  tree  has  three  non-terminal  nodes:  a vowel, 
an  onset,  and  a coda.  The  symbol  "*"  in  the  figure  represents  a null  onset  or  coda. 

So  for  we  have  represented  only  the  knowledge  found  in  the  dictionary  — what 
about  the  knowledge  which  characterizes  the  output  of  the  segmenter-labeler?  The 
mapping  between  the  idealized  speech  given  by  the  pronunciations  of  the  dictionary  and 
the  actual  speech  represented  by  the  labeled  segments  is  accomplished  by  the 
segment-sylpart  trees.  Figure  3.4  shows  the  segment-label  patterns  learned  for  some 
of  the  codas  present  in  the  sample  dictionary.  A segment-label  pattern  is  the  sequence 
of  segment  labels  produced  by  the  segmenter-labeler  for  a particular  sylpart  (particular 
codas  in  this  case)^.  Each  coda  is  followed  by  a list  of  a’ternate  segment-label  patterns. 
Segment  labels  are  enclosed  in  brackets  to  distinguish  them  from  the  phonemes  in  the 
dictionary.  For  example,  the  coda  "K  T S"  has  three  possible  segment  patterns,  the  first 
of  which  is  a sequence  of  three  segment  labels  [-]  (silence),  [K],  and  [S].  The  tree 
storing  these  patterns  is  given  in  Figure  3.5.  The  tree  gives  a many-to-many  mapping 
of  the  segment  patterns  onto  the  codas.  The  segment-vowel  tree  and  segment-onset 
tree  are  similar;  however  the  segment-onset  tree  stores  the  patterns  in  reverse  order 
(i.e.,  right-to-left)  and  the  segment-vowel  tree  stores  them  in  an  order  depending  on 
the  pattern  (to  be  explained  Section  4.3.3.2).  It  is  in  these  trees,  between  the  segment 
level  and  the  sylpart  level,  that  most  of  the  ambiguity  four  i in  speech  is  stored  for  this 


4 The  pointers  shown  are  conceptual.  The  trees  are  implemented  in  a binary  tree 
representation  (see  [Knuth  - 1968],  Vol.  1,  pp.  238),  i.e.,  the  sons  of  any  node  are 
elements  on  a linked  list. 

5 Section  4.3  describes  how  these  patterns  are  learned. 


3 Representation  of  Knowledge 


31 


Segment  Labels  Coda 


system.  Each  sylpart  has  different  segment  patterns  representing  it,  and  each  segment 
pattern  may  represent  more  than  one  sylpart.  The  next  section  discusses  why  these 
levels  were  chosen  for  Noah. 

3.3  Levels  of  Speech  Representation 

The  reasons  for  chosing  a word  level  and  a segment  level  are  clear.  Words  are 
needed  because  we  are  interested  in  word  hypothesization  and  are  the  ultimate  outputs 
of  the  hypothesizer.  Segments  compress  the  speech  information  to  make  pattern 


32 


recognition  tractable.  The  information  of  the  speech  wave  is  compressed  by 
parameterizing  the  acoustic  signal,  using  the  parameters  to  segment  it  into  parts  which 
are  internally  similar,  and  labeling  these  parts.  This  compression  does  not  prevent  using 
the  original  signal  to  resolve  any  remaining  ambiguity  during  later  recognition. 

The  real  question  is  why  include  the  syllable  and  sylpart  levels  rather  than  j 

going  directly  from  segment  patterns  to  words?  There  are  two  constraints  on  this  j 

design  choice  which  need  to  be  balanced:  1)  The  larger  the  unit  of  speech  for  which 
segment  patterns  are  learned,  the  greater  the  number  of  different  units  and  the  more  i 

training  samples  needed  for  each  unit  to  learn  the  variability  which  will  occur  in  the 
segment  patterns,  and  2)  the  smaller  the  unit  of  speech,  the  more  the  segment  pattern 
will  vary  depending  on  the  context  of  other  units  (the  coarticulation  problem).  Thus,  too 
large  a unit  becomes  a learning  and  storage  problem  and  too  small  a unit  becomes  a 
recognition  problem.  The  design  goal  for  Noah  was  to  choose  the  smallest  unit  for 
which  context-caused  variability  can  be  handled  easily. 

This  goal  was  driven  by  two  other  goals:  1)  the  ability  to  handle  large 
vocabularies  and  2)  the  ability  to  retrain  the  system  easily  if  the  segmenter-labler  is 
modified.  The  large-vocabulary  goal  requires  that  not  much  more  than  a simple  base 
pronunciation  for  each  word  be  stored  at  the  word  level.  The  combination  of  both  goals 
requires  that  a small  unit  of  match  be  chosen.  Approximately  5900  different  syllables 
occur  in  the  20,000-word  dictionary.  Using  syllables  as  a unit  of  match  would  require 
very  large  amounts  of  training  data  for  the  system.  Similarly,  it  has  been  estimated  that 
over  700  initial  half-syllables  and  over  900  final  half-syllables  occur  in  general  English®. 

With  large  training  sets  and  storage  space,  half-syllables  or  syllables  could  be  handled^, 
however,  we  have  chosen  to  split  the  syllable  into  three  parts  — the  onset,  vowel,  and 
coda  — and  attempt  to  handle  any  resulting  coarticulation  problems.  In  the  20,000 
word  dictionary,  82  onsets,  18  vowels,  and  128  codas  were  found.  It  is  thought  that 
with  this  size  of  unit  of  match,  the  proper  balance  between  recognition  performance 
versus  storage  and  training  sample  size  is  found  for  a large  vocabulary  word 
hypothesizer. 

A still  smaller  unit  of  speech  for  matching  with  the  segments  could  have  been 
the  phoneme.  Coarticulation  problems  make  it  difficult  for  the  phoneme  to  be 
recognized  accurately  from  the  segment  labels.  Experience  with  a "phone  synthesizer" 
knowledge  source  in  Hearsay-II  [Shockey  & Adam  - 1976]  made  this  clear.  Even  when 


6 See  [Sivertson  - 196T]  for  a summary  of  estimates  on  the  numbers  of  different  sized 
units  in  speech. 

7 Half-syllables,  syllables  or  maybe  even  words  may  have  to  be  used  to  obtain  the 
detail  necessary  for  word  verification  systems. 


3 Representation  of  Knowledge 


33 


phonemes  are  recognized  bottom-up  in  speech,  several  word  pronunciations  must  be 
stored  for  each  word  in  order  to  account  for  coarticulation  effects.  This  is  done  in  the 
HWIM  system  by  storing  about  six  pronunciations  per  word. 

There  are  two  methods  which  Noah  uses  to  handle  coarticulation  problems 
inherent  in  using  onsets,  vowels,  and  codas.  The  first  is  called  "context  learning"  and  is 
used  with  all  three  sylparts;  the  second  is  called  "vowel  sequence  learning"  and  is  used 
as  needed  for  vowels.  The  first  method  is  mentioned  in  the  next  section  but  both  are 
described  fully  in  chapters  4 and  5 on  learning  and  recognition. 

Once  the  sylpart  level  is  chosen,  the  syllable  level  becomes  a natural  bridge 
between  the  sylpart  level  and  word  level  for  efficient  recognition  in  the  hierarchy-tree 
structure.  It  will  be  shown  in  Section  3.6  that  including  the  syllable  level  in  the 
hierarchy-tree  structure  for  a 12,000-word  vocabulary  reduces  the  estimated 
recognition  costs  to  less  than  one-fourth  the  estimated  cost  incurred  without  the  level, 
at  a slight  increase  in  storage  costs. 


3.4  Auxiliary  Information 

In  addition  to  the  knowledge  represented  in  trees  as  described  above,  context 
information  is  stored  at  the  leaves  of  the  segment-sylpart  trees.  Associated  with  each 
sylpart  pointed  to  by  each  segment  pattern  stored  in  the  segment-sylpart  tress  (i.e., 
with  each  leaf  pointed  to  by  each  path  in  a tree),  is  a list  of  segment-label  pairs.  Each 
pair  gives  the  left  and  right  context  segment-labels  of  the  segment  pattern  occuring 
when  the  pattern  was  learned  for  the  particular  sylpart.  For  example.  Figure  3.5  shows 
three  sample  lists  of  segment-label  context  pairs  for  the  coda  "T"  and  three  for  the 
coda  "B".  The  first  context  segment-labels  learned  for  the  segment  pattern  [DX]  of  coda 
"T"  are  "R"  on  the  left  and  "IH3"  on  the  right.  One  can  detect  some  dependencies 
between  the  patterns  and  contexts  shown  in  the  figure.  Generally,  the  coda  "T" 
appears  as  a [OX]  in  the  context  of  two  vowels  (the  top-left  column),  but  as  the  pattern 
[<-  B]  in  the  context  of  a vowel  and  a nasal  or  a liquid  (the  top-right  column).  Also,  if 
the  pattern  [<-]  is  produced  by  the  segmenter  and  labeler  in  the  context  of  a vowel  and 
a fricative,  it  is  more  likely  that  the  coda  "B"  was  spoken  rather  than  the  coda  "T"  (the 
center  two  columns).  It  is  these  kinds  of  dependencies  which  permit  the  word 
hypolhesizer  to  constrain  the  interpretations  of  a particular  segment  pattern.  This  will 
be  described  in  Chapter  5. 

Also  stored  with  each  sylpart  (i.e.,  each  leaf)  in  the  segment-sylpart  tree  is  a 
count  of  the  number  of  times  the  segment  pattern  (represented  by  the  path  in  the  tree 
pointing  to  the  leaf)  has  appeared  for  the  particular  sylpart.  This  information  is  used  to 
compute  a weight  penalty,  as  described  in  Chapter  5. 


34 


Coda:  T 


Pattern:  lOX] 

Pattern:  W 

Pattern:  I«-  BJ 

Context 

Context 

Context 

Left  — Right 

Left  — Right 

Left  — Right 

R — 1H3 

IH3  — 0U3 

AA5  — U 

AA5  — ER 

IH  — NX 

AYX  — Ml 

AA5  ~ IH3 

ER2 

EYR  — EL 

ER2  — AYR 

EYC  ~ IH2 

ER2  --  U 

UH4  — AE 

lY  — NX 

EH4  — AE2 

AYR  — 0 

Coda:  B 

Pattern:  IB] 

Pattern:  W 

Pattern:  I-I 

Context 

Context 

Context 

Left  — Right 

Left  — Right 

Left  — Right 

AA3  — S 

AA5  — S 

AE3  — S 

AE3  — S 

AYC  — ZH 

0U2  — S 

AO  — EL3 

AYR  — S 

AYR  — S 

AO  — 1H6 

AA4  — EL2 

IH 

ER  — S 

AYC  — SH 

Figure  3.6:  Sample  Segment-Label  Context  for  Codas  T and  B 


3.5  Application  of  Storage  and  Recognition  Measures 

Of  what  value  are  the  various  parts  of  the  hierarchy-tree  structure  for  storage 
and  recognition  efficiency  in  Noah?  In  an  attempt  to  answer  this  question  we  gathered 
statistics  for  the  parts  of  the  structure  and  applied  Equations  2.1  through  2.4.  The 
knowledge  acquired  by  Noah  is  divided  into  dictionary  knowledge,  obtained  from  a 
word-phoneme  dictionary  and  stored  in  the  sylpart-syllable  and  syllable-word  trees, 
and  segment-label  Knowledge,  acquired  from  segmented  and  labeled  training  utterances 
and  stored  in  the  segment-sylpart  trees  (this  is  explained  in  Chapter  4).  Because  of 
this  division  of  Knowledge,  the  statistics  and  the  results  for  the  parts  of  the  hierarchy- 
tree  structure  are  separated  into  two  groups  --  one  using  the  Knowledge  obtained  from 
the  dictionary  and  one  using  the  Knowledge  obtained  from  the  training  utterances. 
Table  3.1  A shows  the  statistics  for  the  dictionary  Knowledge  for  three  sizes  of 
vocabularies;  Table  3.1B  shows  them  for  the  segment-label  Knowledge  for  174  training 
utterances. 

The  measures  of  storage  and  recognition  cost  ratios  for  the  hierarchy  part  and 
the  tree  part  of  the  hierarchy-tree  structure  are  computed  using  these  numbers  and 
the  equations  of  Section  2.3.  For  example,  to  compute  the  ratio  of  the  estimated 
recognition  cost  of  using  a syllable  level  to  the  estimated  recognition  cost  of  not  using  it 
for  the  1000-word  vocabulary  we  obtain  from  Eq.  2.2: 


3 Representation  of  Knowledge 


35 


Words  in  Vocabulary: 

1011 

4020 

12049 

Unique  Syl lab  lee: 

1012 

2669 

4584 

Syllables  to  define  Words: 

2304 

10253 

31375 

Nodes  in  Syllable-Word  Tree: 

1784 

7469 

20281 

Unique  Sylparts: 

151 

151 

151 

Sylparts  to  define  Syllables: 

2684 

7439 

12971 

Sylparts  to  define  Words: 

5426 

23618 

71987 

Nodes  In  Sylpart- 

5yl lable  Tree: 

1406 

3286 

5333 

Table  3.1A:  Statistics  for  1000-,  4000-, 

‘ and  12,000-Uord  Vocabularies. 


Total  Words  in  Training:  110B 
Total  Syl lable  Nuclei : 1653 
Unique  Sylparts:  118 
Total  Sylparts:  3950 
Sylparts  to  define  Syllables:  1261 
Unique  Segment  Labels:  98 
Total  Segments:  7038 
Segments  to  define  Sylparts:  3038 


Nodes  in  Segment-Sylpart  Trees:  1716 


Table  3. IB:  Statistics  for  174  Training  Utterances 


^ SylptrLSylhbl*  * SyUtb^Word  / l-Sylpart.Sylhbl* 


From  Table  3.1A  we  see  that  Niyip^pt^iyi  laijii  “ 2684,  “ 2304, 

^«yipart,tyl  labia  average  length  of  syllables  in  sylparts)  ■>  2684/1012  - 2.7,  and 
^ayipart.Hord  “ recognition  cost  ratio  for  the  syllable  level  is  about 

0.65,  which  means  that  the  estimated  reduction  in  cost  of  the  recognition  algorithm  due 
to  the  inclusion  of  syllable  level  in  the  hierarchy-tree  structure  is  about  one-third. 
Table  3.2  shows  the  complete  results  for  the  application  of  the  storage  and  recognition 
cost  ratios.  The  results  are  separated  (by  a dashed  line)  into  two  groups  corresponding 
to  dictionary  Knowledge  (above  the  line)  and  segment-label  Knowledge  (below  the  line). 

We  can  maKe  several  observations  at  this  point.  First,  the  values  derived  from 
the  dictionaries  tend  to  improve  as  the  vocabulary  gets  larger.  This  is  expected;  as 
more  words,  syllables  and  sylparts  are  stored,  the  chances  of  common  subpatterns 
increase.  Second,  the  storage  cost  for  a tree  used  between  two  levels  is  always 
greater  than  a simple  storage  structure.  This  is  because  of  the  need  to  store  pointers 


36 


Level  Measure  Vocabulary: 

1000 

4000 

12,000 

Syllable 

Hierarchy  Storage  (Eq.  2.1): 

1.08 

1.00 

0.91 

Hierarchy  Recognition  (Eq.  2.2): 

0.65 

0.47 

0.34 

Tree  Storage  (Eq.  2.3): 

1.60 

1.55 

1.41 

Tree  Recognition  (Eq.  2.4): 

0.77 

0.73 

0.65 

Sy 1 par  t 

Tree  Storage: 

1.28 

1.15 

J..11 

Tree  Recognition: 

0.52 

0.44 

0.41 

174  Training  Utterances- 
Hierarchy  Storage:  0.77 

Hierarchy  Recogni t ion:  0.67 


Hierarchy  Storage:  0.77 

Hierarchy  Recogni t ion:  0.67 

Segment 

Tree  Storage  1.48 

Tree  Recognition:  0.56 


Table  3.2:  Results  for  Storage  and  Recognition  Cost  Measures 

for  the  tree.  However,  we  thinK  the  savings  in  recognition  costs  are  worth  it.  (These 
savings  are  increased  when  various  heuristics  for  pruning  the  tree  search  are  used.) 
The  final  observation  is  that  we  can  obtain  a total  cost  ratio  estimate  for  including  a 
particular  level  and  storing  the  units  of  the  level  in  a tree  by  multipling  the  value  of  the 
storage  (or  recognition)  hierarchy  cost  ratio  for  the  level  by  the  value  of  the  storage 
(or  recognition)  tree  cost  ratio  for  the  level.  For  example,  the  total  storage  cost  ratio 
for  including  a syllable  level  for  the  12,000-word  vocabulary  is  1.28  (>  .91  * 1.41);  the 
total  recognition  cost  ratio  is  0.22  (>  .34  * .65).  Thus,  the  estimated  recognition  costs 
are  reduced  to  less  than  one-fourth  by  the  use  of  the  syllable  level,  at  a slight  increase 
in  storage. 


3.6  Storage  Costs 

Figure  3.7  shows  the  storage  costs  for  the  different  vocabularies  tested.  The 
center  curve,  which  includes  the  storage  for  the  sylpart-syllable  tree  and  for  the  non- 
terminal nodes  of  the  syllable-word  tree  (i.e.,  just  nodes  storing  the  syllables  of  the 
words),  shows  a gradual  decline  in  the  amount  of  storage  required  for  each  vocabulary 
word.  This  decline  shows  the  characteristic  of  the  trees  to  share  common  information. 
The  storage  added  onto  this  curve  to  give  the  top  curve  is  almost  linear  with 
vocabulary  size.  This  storage  is  due  to  the  terminal  nodes  of  the  syllable-word  tree 
(each  of  which  points  to  a word)  and  various  information  stored  for  eacn  word,  such  as 


3 Representation  of  Knowledge 


ili 


StOfMt  ♦ SyUiblt^WordlTrM  ♦ Sylp«rt*SyUibl«  TrM  I 


ii 


nltrmtfMl  Nodfts  of  tyllab)»-^Jord  ' r« 

♦ SylpariiSyilabi*  fr**  I 


Sylptrt-Sjrllatolf  Tra* 


S 12  1( 

VOcaMwy  Siz*  <vlOOO) 

Fifur*  3l7:  Stertf • Cecte  varsui  Vocabulary  Stza  far  Dictionary  KnowlodfO 


its  spelling.  (Word  spellings  use  about  27K  (19K  of  144K)  for  the  19,000-word 
vocabulary,  and  are  only  used  for  analysis  output.) 

In  Figure  3.8,  the  x-axis  gives  the  number  of  segments  for  new  sylpart  segment- 
patterns  for  various  numbers  of  training  utterances.  Rather  than  plot  the  storage  costs 
of  the  segment-label  knowledge  versus  the  total  number  of  segments  in  the  training 
utterances,  we  ignored  those  segment-patterns  wtiich  were  redundant  (i.e.,  had  already 
been  learned)  and  counted  the  number  of  segments  in  new  segment-patterns.  The 


38 


\ 

i 

i 

I 


lowest  plot  shows  how  much  storage  is  used  for  the  segment-labels.  The  center  plot 
adds  to  this  the  terminal  nodes  of  the  segment-sylpart  tree,  i.e.,  the  sylparts.  Although 
it  is  not  obvious  from  the  plots,  nonterminal  nodes  are  added  at  a slightly  faster  rate 
than  terminal  nodes.  This  happens  because  sylparts  often  share  segment-patterns;  once 
a segment-pattern  is  put  in  the  tree,  no  new  nodes  will  be  added  for  the  same  pattern, 
but  a new  sylpart  (stored  in  non-terminal  node)  may  be  added  for  the  pattern.  The 
highest  plot  shows  the  inclusion  of  context  storage.  It  is  the  context  information  which 
is  used  to  distinquish  among  sylparts  which  share  segment  patterns. 


MSHWa*  400  300  1200  1600  2000  2400  3000 

Numbar  of  Satmantt  in  Triinint  for  Maw  Syfpart  Satmant-Pailarna 


/ 


nfura  3.8:  Stori(a  Coats  varsus  Numbar  of  Sa(mantt 
in  TraMni,  for  $a(mont-l.abaf  Xnowladgo 


3 Representation  of  Knowledge 

3.7  Other  Knowledge  Representations  for  Speech 


39 


Many  types  of  knowledge  representations  have  been  used  or  suggested  for  use 
in  speech  recognition.  We  will  briefly  look  at  the  characteristics  of  four  here,  each  of 
which  is  used  to  store  sequential  information.  In  particular,  we  are  interested  in  how 
each  might  work  for  storing  knowledge  for  a word  hypothesizer.  The  four 
representations  are  1)  A tree  representation  used  by  the  lexical  retrieval  component  of 
the  HWIM  system  [Klovstad  - 1976],  2)  A network  used  by  the  Harpy  system  [Lowerre  - 
1976],  3)  Wood’s  Augmented  Transition  Network  (ATM)  used  by  the  syntax-semantic 
parser  of  the  HWIM  System  [Woods,  et  al.  - 1976],  and  A)  An  Automatically  Compilable 
Recognition  Network  (ACORN)  [Hayes-Roth  & Mowstow  - 1975]  used  initially  as  the 
parser  of  the  Hearsay-II  system. 

Comparison  of  these  four  at  the  level  of  storage  and  recognition  efficiency  is 
difficult,  if  not  impossible,  since  each  representation  is  used  within  a different 
framework  and  each  for  a different  goal.  One  dimension  which  they  can  be  compared  is 
the  relative  amounts  of  knowledge  content  and  knowledge  structure  each  stores.  The 
simple  storage  structure,  for  example,  stores  only  Knowledge  content;  the  structure  of 
the  knowledge  is  implicit  in  the  assumption  that  we  are  storing  sequences.  At  the  other 
end  of  the  scale  is  the  ACORN  representation.  In  this  lattice  type  representation  each 
primative  knowledge  unit  is  stored  only  once.  However,  pointers  and  higher  level  nodes 
combine  to  structure  the  content  into  the  same  sequential  informatioa  Figure  3.9 
orders  several  representations  on  this  dimension. 


SlMpla-Sterag*  Traa 

Hiararchy-traa 

Transit len-NstHork 

i i 

i 

4 

• at 

t 

a 

a a 

t 

t 

t 

Hitrarchy 

Naluorl; 

nCORN 

naxlMua  Cantant, 

ninianm  Cenlsnt, 

nininua  Siruclura 

naxinuB  Struclura 

Figure  3.9:  Ordering  of  Representations. 

The  ordering  is  by  our  intuition  (somewhat  based  on  experience)  about  how 
these  representations  would  store  word  speech  knowledge. 

3.6.1  Tree 

The  Lexical  retrieval  component  of  the  HWIM  system  joins  a phonetic  level  with 
the  word  level  by  storing  word  pronunciations  in  a tree  structure.  Each  word  (counting 
inflected  forms  as  different  words)  has  on  the  average  more  that  six  pronunciations 
stored  to  account  for  1)  within-word  variations  due  to  palatalization,  syllabification. 


r 


'1 


40 

vowel  reduction,  and  other  phonological  phenomena,  2)  within-word  variations  due  to 
the  peculiarities  of  the  acoustic-phonetic  recognition  component  and  3)  end-of-word 
variations  due  to  the  affect  of  preceding  and  following  words.  The  latter  variations, 
which  account  for  two-thirds  of  the  extra  pronunciations,  are  constrained  during  word 
recognition  if  the  word  context  is  knowa 

It  it  interesting  to  compute  the  storage  and  recognition  cost  ratios  for  the 
Knowledge  stored  in  this  tree  structure  for  the  HWIM  system.  A tree  of  4371  nodes  is 
required  to  store  1960  pronunciations  consisting  of  10616  phonemes  (71  different 
phonemes)  [Klovstad  - 1976].  Using  Eq  2.3  we  have: 

TS  - 4371»(log  ,71  . log  ^4371)  . j 

10616  log271 

for  the  tree  storage  ratio,  and  from  Eq.  2.4  we  have: 

TR  - 4371  / 10616  - 0.41 

for  the  tree  recognition  ratio.  Thus,  for  a 2ZZ  increase  in  storage  over  the  simple 
storage  structure,  the  tree  reduces  recognition  costs  to  41)1  of  the  simple  recognition 
costs®. 

It  is  tempting  to  compare  these  ratios  to  the  ratios  of  Table  3.2  for  Noah,  but 
differences  in  the  data  stored  (i.e.,  phonemes  instead  of  sylparts  and  six  pronunciations 
per  word  instead  of  approximately  one  per  word)  make  such  a direct  comparison  hard 
to  evaluate. 

However,  we  can  partially  answer  the  important  question  suggested  by  such  a 
comparison:  What  does  Noah  gain  by  using  a hierarchy-tree  representation  rather  than 
just  a tree  representation?  The  partial  answer  is  found  by  comparing,  for  Noah,  the 
storage  and  recognition  costs  of  a sylpart-word  tree  to  the  storage  and  recognition 
costs  of  the  sylpart-syllable  and  syllablerword  trees  combined.  For  the  storage  cost 
comparison,  the  numerator  of  Eq.  2.3  gives  the  storage  cost  of  each  tree;  the  ratio  of 
the  sum  of  the  storage  costs  for  the  sylpart-syllable  tree  and  the  syllable-word  tree  to 
the  storage  costs  of  a sylpart-word  tree  gives  the  comparison: 

Storage  Ratio  - 

^Sylp«»l,Syl1ibl»  ( log  2 ^ Sylparl  * ^Sylp»f >,Syllibl»  ^ * ^SylMiKWord  ( *08  2 ^ Sylteto  * ^Syn*bW,Wore  ^ 
^Sylp«r<,Werd  ^ 2 ^ Sylpart  * ^Sylpwl,Word  ^ 


8 Of  course,  these  are  only  estimates,  based  on  an  assumed  storage  implementation  and 
recognition  algorithm  (as  described  in  Chapter  2).  In  particular,  the  ratio  falsely 
assumes  the  ability  to  used  6.15  bits  (-logjTl)  for  storing  each  pointer  of  the  tree. 


4 


3 Representation  of  Knowledge 


41 


It  was  found  that  8471  nodes  were  required  in  a sylpart-word  tree  for  a 4000  word 
vocabulary.  Using  this  value  and  values  obtained  from  Table  3.1A  we  have: 

3286(log  2 151  * log  23286)  ♦ 7469(log  , 2669  * log  27469)  _ j ^ 
Storage  Ratio  - 8471(log  jlBl  4-  log  j 8471) 

The  recognition  cost  ratio  comparison  is  simply: 

Recognition  Ratio  - ^Sylp«ft^yll»bi>  * 

^Sylpar(,Werd 


3286  + 7469/2.3 
8471 

Thus,  a 23^  speed  up  is  obtained  by  adding  407  more  storage.  Both  of  these 
values  would  improve  for  larger  vocabularies. 

The  lexical  retrieval  component  of  HWIM  also  stores  auxiliary  information  in 
order  to  constrain  the  search  of  the  tree  when  searching  for  words  of  a particular 
length  or  words  of  a particular  syntactic-semantic  category.  For  example,  the  syntax 
and  semantics  component  of  the  system  can  ask  the  lexical  retrieval  component  to  find 
all  verbs  matching  well  between  two  time  periods  in  the  utterance.  A tree  structure 
also  exists  for  doing  a backward  search  in  the  utterance.  In  this  tree,  the  phonemes  of 
the  final  parts  of  word  pronunciation  share  common  nodes.  This  permits  finding  all 
words  which  match  the  phones  of  an  utterance  occuring  before  a particular  part  of  the 
utterance.  Thus,  the  system  can  query  the  lexical  retrieval  component  for  all  words 
ending  at  a particular  time  in  the  utterance. 

3.6.2  Network 

A tree  combines  common  initial  parts  of  sequences,  but  a network  combines  all 
possible  parts  of  sequences  within  the  constraint  of  preserving  the  uniqueness  of  each 
sequence.  A tree  "remembers”  the  past  sequence  whereas  a network  "forgets”  what 
symbols  have  been  traced  on  a particular  path.  It  is  this  "forgetting"  which  makes  a 
network  less  suitable  for  word  hypothesizatioa  The  Harpy  Speech  system  uses  a 
network  representation  to  store  all  possible  segment-label  sequences  which  could 
appear  for  all  possible  sentences  generated  by  its  grammar.  Since  the  system  needs  to 
find  only  the  best  path  through  this  network,  it  is  economical  to  trace  forward  through 
the  network  to  follow  the  best  paths  and  then  to  backtrace  to  find  the  best  one.  A 
word  hypothesizer  might  have  to  do  an  expensive  backtrace  to  find  the  best  N paths  if 
it  used  this  representation. 


j 


) 

I 

■I 


r 


42 


3.6.3  Transition  Notwork 

Harpy’s  recognition  network  is  made  by  merging  a grammar  networU  (m  which 
each  node  is  a word)  with  the  phonetic  networks  describing  the  possible  patterns  of 
each  word.  Before  the  merge  is  done  we  have  a structure  similar  to  a two  level 
transition  network.  The  difference  is  that  the  nodes  of  the  grammar  network  are 
expanded  to  the  lower  level  networks  rather  than  the  transitions  between  the  nodes. 
In  either  case  these  structures  are  related  to  networks  in  the  same  way  the  hierarchy 
structure  is  related  to  the  simple  storage  structure.  The  levels  of  the  more  complex 
structures  permit  the  sharing  of  similar  parts  of  the  more  simple  structures.  Whether  or 
not  the  more  complex  structures  reduce  storage  space  or  recognition  time  depends  on 
the  knowledge  being  stored.  In  the  case  of  Harpy’s  network  it  may  be  that  fewer  nodes 
and  pointers  are  required  (for  the  grammars  and  vocabularies  involved)  when  the  two 
levels  are  merged  into  one  network  and  reduced  by  removing  null  states  and  redundant 
states  and  by  doing  "subsumption'*  of  common  states^. 

An  augmented  transition  network  has  been  use  as  the  parser  in  the  HWIM 
speech  system.  The  term  "augmented"  refers  to  associating  an  action  or  a set  of 
actions  with  an  arc  between  two  states  of  the  network[Woods  - 1970].  These  actions 
are  to  be  executed  whenever  a path  traced  by  the  parser  uses  the  arc.  Actions  are 
used  to  build  up  a semantic  interpretation  of  the  parse,  to  store  current  conditions  of 
the  parse,  or  to  check  various  conditions  necessary  for  continuing  the  parse.  An  arc 
can  be  one  of  five  types  for  the  HWIM  system,  corresponding  to  the  type  of  action 
stored  on  it.  The  five  types  are;  PUSH  POP,  WRD,  CAT  and  JUMP.  The  PUSH  arc 
represents  a nonterminal  of  the  grammar  and  signals  the  parser  to  enter  a lower  level 
network  describing  the  nonterminal.  The  POP  arc  signals  the  return  from  the  lower 
level  network  to  the  original  level.  The  WRD  and  CAT  arcs  match  words  and  syntactic- 
semantic  catagories  of  words,  respectively.  Thus,  it  is  these  arcs  which  match  the 
words  found  by  the  lexical  retrieval  component.  The  JUMP  arc  permits  the  parser  to 
jump  between  certain  states  without  "consuming"  any  of  the  utterance.  For  example, 
two  states  having  an  "adjective"  arc  (which  would  be  a CAT  type  of  arc)  between  them 
could  also  have  a JUMP  arc  between  them  indicating  that  the  adjective  is  optional 
between  those  states  according  to  the  grammar. 

The  disadvantage  of  a transition  network  representation  for  word 
hypothesization  is  similar  to  the  disadvantage  of  a network  representatioa  A transition 
network  is  oriented  towards  finding  the  best  path  (i.e.,  best  parse,  in  the  above 
example)  and  not  many  best  paths  as  would  be  needed  for  a word  hypothesizer. 

9 From  conversation  with  Lowerre.  See  [Lowerre  1976]  for  details  of  this  network 
reduction. 


3 Representation  of  Knowledge 


43 


However,  it  might  be  possible  to  use  an  augmented  transition  network  whose  ‘‘actions” 
explicitly  save  the  best  word  hypotheses. 

3.6.4  ACORN 

At  the  extreme  structure  end  of  the  content-structure  dimension  we  have  the 
recognition  network  called  an  ACORN  (for  Automatically  Compiled  Recognition  Network) 
[Hayes-roth  & Mostow  - 1975],  which  has  been  suggested  for  speech  and  vision 
recognition  and  for  a time  served  as  the  base  of  the  syntactic  parser  of  the  Hearsay-II 
speech  system.  As  a parser,  the  recognition  network  was  automatically  compiled  from  a 
description  of  the  grammar  with  the  goal  of  "maximally  exploiting  repeated  subparts  of 
the  grammar"  - i.e.,  the  number  of  nodes  in  the  network  was  minimized.  Thus  we  have 
the  case  of  minimizing  content  storage  without  regard  to  structural  storage  cost. 

As  a parser  of  the  Hearsay-II  system,  the  ACORN  representation  had  unique 
terminal  nodes  corresponding  to  each  word  in  the  vocabulary.  Nonterminal  nodes 
corresponded  to  phrases  (i.e.,  sequences  of  words).  In  this  lattice  type  representation, 
a node  X is  linked  to  any  and  all  (conceptually)  higher  nodes  which  contain  the  word  or 
phrase  associated  with  X as  a subpart  (a  "conjunctive"  grouping  at  the  higher  node)  or 
as  a member  of  a class  of  phrases  or  words  associated  with  the  higher  node  (a 
"disjunctive"  grouping  at  the  higher  node).  For  example,  there  is  a unique  terminal  node 
associated  with  the  word  "Me".  This  node  is  linked  to  nonterminal  nodes  such  as  those 
representing  "Give-Me"  and  "Tell-Me"  as  part  of  a conjunctive  grouping.  The  node 
might  also  be  linked  to  a nonterminal  node  such  as  "Personal-Pronoun"  which  has  a 
disjunctive  grouping.  (Eg.,  "Us  and  "I"  might  also  be  linked  disjunctively  to  "Personal- 
Pronoun") 

During  recognition,  a word  hypothesis  triggers  the  unique  terminal  node 
corresponding  to  it.  This  constitutes  a match  of  the  node  with  the  speech  input.  This 
node  passes  the  positional  and  rating  information  of  the  word  along  its  links  to  higher 
nonterminal  nodes.  A nonterminal  node  with  a disjunctive  grouping  of  lower  nodes 
passes  the  information  on  up  the  network,  but  a node  with  a conjunctive  grouping  of 
lower  nodes  is  only  partially  matched  until  all  of  its  lower  nodes  "report"  with  words  or 
phrases  which  1)  form  an  adjacent  sequence  and  2)  pass  a combined  rating  test.  When 
these  conditions  are  met,  the  node  passes  the  new  information  along  its  links  to  yet 
higher  nodes.  This  process  continues  until  some  top  level  node  signals  that  a best 
complete  parse  has  been  found. 

Ideally  this  recognition  process  gives  a bottom-up,  non-backtracking  parser 
capable  of  starting  its  work  at  any  and  all  places  in  the  utterance.  However,  in  practice, 
a combinatorial  explosion  occurs  due  to  1)  the  errorful  nature  of  word  hypotheses  (not 
all  of  the  correct  hypotheses  are  made  and  many  incorrect  ones  are  made)  and  2)  the 


44 


number  of  partially  matched  nodes  made  for  each  match  of  a node  at  a lower  level.  For 
example,  the  word  "Me"  may  be  hypothesized  many  times  incorrectly.  Each  time  it  is 
hypothesized,  all  nodes  in  the  representation  containing  "Me"  as  a subpart  are  partially 
matched.  Since  not  all  correct  word  hypotheses  are  present,  partially  matched  nodes 
must  be  used  to  predict  the  missing  words.  Thus,  "Me"  will  be  used  to  predict  "Give" 
and  Tell"  (from  the  example  above).  The  number  of  partially  matched  nodes  and  the 
number  of  resulting  predictions  swamp  the  system  [Hayes-Roth,  Mostow  and  Fox  - 
1977]^®. 

It  seems  that  a word  hypothesizer  using  errorful  segment-label  hypotheses 
would  have  a similar  combinatorial  explosion  using  an  ACORN  representation. 


10  Part  of  the  solution  to  this  problem  for  Hearsay-II  was  first  to  pass  the  word 
hypotheses  through  a simple  parser  that  found  sequences  of  words  which  were 
pairwise  grammatically  adjacent.  The  ACORN  algorithm  took  each  sequence,  verified 
that  it  was  grammatical  (or  that  there  was  a usable  subpart  of  it  that  was 
grammatical)  and  continued  searching  the  network.  Essentially,  the  first  pass  parser 
permitted  the  ACORN  algorithm  to  begin  at  higher  level  nodes  in  the  representation. 


a i*  . > » 


Chapter  4;  Acauifition  of  Knowledge 


4.1  Introduction 

In  this  chapter  we  address  the  important  issue  of  acquiring  knowledge  for  the 
Noah  word  hypothesizer.  Knowledge  acquistion  is  a central  problem  for  AI  Knowledge- 
based  systems  [Feigenbaum-77].  Difficulty  in  acquiring  knowledge  can  prevent  a 
Knowledge-based  system  (no  matter  how  ingenious)  from  advancing  from  "toy*  problems 
to  real-world  problems.  Special  attention  was  given  to  the  design  of  Noah  so  that  it 
could  acquire  the  knowledge  necessary  for  a "real-world"  vocabulary. 

Let  us  consider  three  possible  methods  of  acquiring  knowledge.  In  general, 
knowledge-based  systems  can  acquire  their  knowledge  by  1)  the  manual  method  in 
which  someone  looks  at  the  problem  confronting  the  system  and  puts  in  the  needed 
knowledge,  usually  at  a very  detailed  level;  2)  the  rule  method;  in  which  the  required 
knowledge  has  been  reduced  to  a set  of  rules  which  are  encoded  for  the  system;  and  3) 
the  automatic  learning  method  in  which  the  knowledge  from  training  samples.  A system 
might  use  more  than  one  method.  Below  we  discuss  how  these  methods  appear  for 
speech  systems. 

In  the  manual  method,  one  looks  at  examples  of  the  input  in  order  to  hand-tune 
word  templates  (or  whatever  the  unit  of  the  pattern  match  is  chosen  to  be).  After  the 
initial  tuning,  this  method  involves  a cycle  of  attempted  recognition  on  new  input, 
investigation  of  errors,  and  readjustment  of  templates.  The  method  is  potentially  very 
accurate.  However,  it  is  time  consuming  work,  it  may  lack  generality,  and  ‘t  does  not 
permit  changes  to  the  segmenter-labeler  without  starting  from  scratch.  The  Harpy 
speech  system  [Lowerre-1976]  uses  primarily  this  method. 

In  the  rule  method,  much  of  the  speech  knowledge  is  encoded  in  the  form  of 
phonetic  rules  which  attempt  to  account  for  the  differences  between  the  ideal  word 
pronunciations  and  observed  speech.  These  rules  can  be  applied  either  in  a top-down 
mode  by  modifying  the  dictionary  word  pronunciations  (as  is  done  in  the  HWIM  speech 
system)  or  in  a botfom-up  mode  by  modifying  (he  output  of  the  segmenter-labeler.  (An 
early  version  of  Hearsay-II  attempted  this  [CMU  Computer  Science  Speech  Group  - 


46 


1976].)  The  problem  with  using  the  rules  in  a bottom-up  mode  is  that  the  rules  are 
based  on  a generative  model  of  speech  which  describes  the  transformations  from  words 
to  the  speech  signal.  Unfortunately  these  transformations  are  not  easily  or  uniquely 
reversible.  The  main  problem  for  either  the  bottom-up  or  top-down  modes  is  knowing 
when  to  apply  the  rules.  "Every  rule  has  its  exceptions"  is  very  true  of  linguistic  rules. 
Variables,  such  as  intonation  and  rate  of  speech,  often  modify  them. 

The  automatic  learning  method  requires  1)  a framework  which  is  able  to  contain 
the  necessary  speech  knowledge  and  2)  preclassified  samples  used  to  put  the 
knowledge  into  the  framework,  i.e.,  to  train  the  system.  The  difficulties  are  designing 
the  framework  (knowing  what  type  of  knowledge  will  be  needed)  and  getting  sufficient 
amounts  of  correctly  classified  training  data.  If  the  framework  is  too  general,  the 
sample  size  will  be  too  small;  if  it  is  too  narrow,  it  may  never  contain  enough  knowledge 
to  recognize  words  accurately. 

Noah  has  aspects  of  all  three  acquisition  methods.  The  manual  method  is  used  to 
acquire  base  pronunciations  of  words.  This  knowledge  is  dictionary  knowledge;  it  is 
constant  for  a given  vocabulary;  it  gives  the  patterns  of  words  in  terms  of  sylparts;  and 
it  is  considered  a priori  speech  knowledge.  The  rule  method  is  used  to  modify  some  of 
the  base  pronunciations  to  account  for  schwa  deletions  (discussed  below).  The 
automatic  learning  method  is  used  to  acquire  the  patterns  for  sylparts  in  terms  of 
segment  label  patterns  This  knowledge  characterizes  the  particular  segmenter-labeler 
used  by  the  word  hypothesizer  and  is  considered  learned  speech  knowledge.  We  next 
describe  the  acquisition  of  dictionary  knowledge  along  with  the  application  of  the  schwa 
deletion  rules.  This  is  followed  by  a description  of  the  acquisition  of  segment  label 
knowledge  (Section  4.3). 


4.2  Dictionary  Knowledge  | 

I 

One  goal  for  Noah  was  the  ability  to  acquire  and  store  the  knowledge  for  a very  ] 

large  number  of  words  easily.  This  is  made  possible  by  using  a standard  pronunciation  | 

dictionary  as  the  source  of  knowledge.  A computer-readable  20,000-word  J 

pronunciation  dictionary^.  Parts  of  this  dictionary  (every  Nth  word)  were  combined 
with  the  1000-word  Hearsay-II  dictionary  (containing  the  words  of  the  test  sentences) 
to  obtain  various  sizes  of  test  vocabularies.  ■ | 

Only  one  type  of  pronunciation  variation  was  added  to  the  base  pronunciations  | 

found  in  the  dictionary.  Variations  due  to  vowel  deletions  are  added  by  rule  as  the  * | 


1 We  used  the  pronunciations  from  "The  New  Weriam  Webstr  Pocket  Dictionary  - 1964" 
received  from  Richard  (aoldhor  and  Jon  Allen  at  MIT  as  produced  by  John  OIney  and 
Donald  Ramsey  (dcscribod  in  [OIney  & Ramsey  - 1972]). 


f 


4 Acquisition  of  Knowledge  47 

dictionary  is  processed^.  For  example  the  second  schwa  in  the  word  “summary"  (S  AX 
M - AX  - R lY)  is  commonly  deleted  so  that  the  word  is  often  pronounced  S AX  M - R lY. 
Since  the  schwa  is  not  in  the  speech,  the  only  way  that  a bottom-up  syllable-based 
word  hypothesizer  can  recognize  the  word  is  to  store  the  two  syllable  pronunciation. 

Knowing  when  to  delete  a schwa  Is  not  easy.  Cole,  who  has  summarized  and 
indexed  the  phonological  rules  of  the  ARPA  speech  community  [Cole  - 1974],  concludes 
his  summary  of  16  vowel  deletion  rules  with  the  statement:  “it  seems  that  the  process 
of  vowel  deletion  is  a rather  complex  one,  not  yet  well  understood".  One  problem  is 
that  the  deletion  rules,  summarized  by  Cole,  go  beyond  reasonable  schwa  deletions  as 
found  in  the  words  “summary",  "reference"  (R  EH  F - (AX  -)  R AX  N S),  and  "cardinal"  (K 
AA  R 0 - (AX  -)  N EL  ),  to  words  which  must  be  considered  exceptions  to  the  rules  such 
as  "agony"  (AE  G - (AX  -)  N lY)  and  "element"  (EH  L - (AX  -)  M AX  N T).  A second 
problem  is  that  schwa  deletion  depends  on  the  rate  and  manner  of  speech.  For 
example,  deleting  the  schwa  in  the  word  "karate",  giving  (K  R AA  T - lY),  might  occur  in 
fast  speech,  but  in  carefully  articulated  speech  it  would  be  considered  mispronounced. 
Such  is  not  the  case  with  the  word  "summary". 


Saapla  RuU  li  Oalttt  RX  In  laH  conttxt  of  otrociod  tylUblo,  final  B>, 
and  right  contaxt  of  <unitratiad  ayl labia.  Initial  R>. 

Elaborato  IH  - IL  RE  B - (RX  ->  R RX  T 

Laboratory  IL  RE  B - (RX  -)  R RX  - 2T  ON  - lY 

Robbory  IR  RR  B - (RX  ->  R lY 

Saapit  Rula  2:  Oatata  RX  In  laft  contaat  of  <BtraBiad  Byl labia,  final  Xa, 
and  right  eontaxt  of  <unBtraBBad  ayllabla,  Initial  L>. 

Broccoli  IB  R RR  K - (RX  -)  L lY 

Chocelata  IT  SH  RR  K - (RX  -)  L RX  T 

St  lex  lor  IS  T IH  K - (RX  -)  L ER 

Saapio  Rula  3i  Dalato  RX  In  loft  contaxt  of  <BtraBBad  Byllabla>  and  Initial  0, 
and  right  contaxt  of  xunatraBoad  ayllabla.  Initial  R. 

Harga  0 with  R. 

Boundary  IB  RH  N - 0 (RX  -)  R lY 

Handarln  111  RE  N - D (RX  -)  R RX  N 


Exaaploa  of  Morda  in  uhich  achMas  Hora  not  dalatadi 

Rcadony  RX  - IK  RE  D - RX  - It  lY 

UidoHor  IN  lY  0 - RX  - H ER 

Liboration  2L  IH  B - RX  - IR  EY  - SH  RX  N 

Illltarata  IH  L - IL  IH  T - RX  - R RX  T 

Ritual  IR  IH  T SH  - RX  - U EL 

Vocal Iza  IV  OY  - K RX  - 2L  RY  Z 

Pari  laaant  IP  RR  R - L RX  - It  RX  N T 

PIguro  4.1i  Saapio  Schua  Oalatlon  RuIob  and  Exaaploa. 

2 The  pronunciation  format  of  the  words,  as  described  in  Chapter  2,  permits  specifying 
any  type  of  pronunciation  variation.  However,  this  feature  was  not  used. 


In  deciding  what  schwas  to  delete,  we  chose  to  limit  the  deletions  to  what  might 
occur  in  carefully  articulated  speech.  Using  the  rules  collected  by  Cole  as  a reference 
we  studied^  the  contexts  in  which  schwas  occured  in  the  20,000-word  dictionary.  This 
resulted  in  the  set  of  rules  given  in  Appendix  C.  Figure  4.1  gives  three  examples  of  the 
schwa  deletion  rules  with  examples  of  their  application.  In  addition,  the  figure  gives 
examples  of  words  from  the  dictionary  which  contain  schwas  which  were  not  deleted. 

As  the  pronunciation  of  each  word  is  read  from  the  dictionary  and  added  to  the 
syllable-word  tree,  alternate  pronunciations  are  added  according  to  the  schwa  deletion 
rules.  These  rulOs  expand  the  number  of  pronunciations  by  only  2.5^  Since  these 
rules  do  not  apply  often  to  the  dictionary  and  since  the  test  utterances  were  made  up 
of  carefully  articulated  speech,  the  rules  have  little  effect  on  the  performance  of  Noah. 
The  rules  applied  to  only  10  instances  of  the  1160  words  in  the  training  utterances 
(discussed  in  the  next  section),  and  to  only  3 instances  of  the  705  words  in  the  test 
utterances.  Although  the  schwa  deletion  rules  could  be  removed  with  little  effect,  it 
was  not  clear  beforehand  what  their  effect  would  be.  The  more  common  occurrence  of 
merging  a schwa  with  a nearby  vowel  (so  that  schwa  and  the  vowel  share  a syllable 
nuclei)  is  handled  by  vowel  sequence  learning,  described  in  Section  4.3.3.3. 


4.3  Segment-Label  Knowledge 

4.3.1  Hand  Segmentation 

Segment  patterns  for  sylparts  are  obtained  by  hand-segmenting  speech  into 
sylpart-sized  sections^.  It  should  be  noted  that  hand-segmentation  of  speech  into 
sylpart-sized  sections  is  not  person  independent.  Occasionally  people  will  disagree 
about  where  a sylpart  boundary  occurs  or  whether  a "vowel  sequence"  (mentioned 

3 It  should  be  noted  that  that  very  little  real  speech  was  looked  at  (only  the  174 
training  utterances)  in  refining  the  rules.  The  decision  of  whether  to  delete  the 
schwa  in  a certain  context  was  based  on  the  subjective  test  of  whether  the  words 
having  a schwa  in  that  context  "sounded"  right  without  the  schwa. 

4 This  potentially  time  consuming  work  was  simplified  by  using  an  interactive  program 
which  displays  the  speech  waveform  for  each  utterance  on  a graphics  terminal  and 
shows  the  segmentation,  best  label  choice,  and  segment  class  labels  of  the  current 
segmenter-labeler  on  the  waveform.  The  program  then  queries  the  user  word-by- 
word  for  the  correct  begin  and  end  times  of  each  sylpart.  The  output  of  the  program 
is  a file  which,  when  used  with  the  output  of  a segmenter-labeler,  defines  the 
segment  patterns  of  each  sylpart. 


• < 


4 Acquisition  of  Knowledge 


49 


r 


below)  should  be  formed.  However,  the  important  thing  for  pattern  recognition  is 
consistency  in  hand-segmentation.  Since  one  person  (the  author)  did  all  of  the  hand- 
segmentation,  some  degree  of  consistency  was  obtained. 

4.3.2  Segment  Pattern  Learning 

Learning  a segment  pattern  for  a sylpart  involves  storing  the  sequence  of  top- 
rated  segment  labels  in  the  appropriate  segment-sylpart  tree  (onset,  vowel,  or  coda)®. 
This  simple  scheme  is  modified  by  two  methods  of  merging  similar  patterns  in  order  to 
save  storage  space.  Figure  4.2  shows  the  top  five  segment-labels  for  the  utterance 
"Please  Help  Me".  The  number  in  parentheses  before  each  label  is  the  rating  of  the 
label,  generated  by  the  labeler  and  normalized  by  subtracting  off  the  best  rating;  the 
lower  the  number  the  better  the  rating.  The  class-label  is  a broad  characterization  of 
the  segment.  Its  use  to  Noah  will  be  described  later  (Section  4.3.3.1). 


SECnENT  SEGHENT 

NO.  TINE  CLBSS-LflBEL  TOP  FIVE  SECHEHT-LBBELS  BNO  RRTIHGS 

12  3*5 


1 

(*B:6*> 

SIL 

(6) 

- 

(16) 

TH 

(25) 

F 

(38) 

(37) 

Z 

2 

(64:78) 

nF2 

(8) 

PL 

( 7) 

T 

(14) 

S 

(15) 

TH 

(18) 

- 

3 

(78:79) 

VIY 

(8) 

1H6 

( 2) 

IH4 

( 7) 

IH7 

( 9) 

IH3 

(14) 

IH2 

* 

(79:89) 

NGL 

(8) 

Y 

(12) 

ER3 

(15) 

G 

(17) 

NX 

(19) 

IX 

5 

(89:93) 

FRV 

(8) 

S 

(19) 

TH 

(19) 

PL 

(28) 

ZH 

(25) 

T 

6 

(93:96) 

HUF 

(8) 

S 

(27) 

ZH 

(36) 

SH 

(48) 

PL 

(45) 

TH 

7 

(98:181) 

HUF 

(8) 

T 

( 5) 

0 

( 6) 

TH 

( 7) 

K 

( 7) 

F 

8 

(181:187) 

HHV 

(8) 

HH 

(12) 

V 

(13) 

OH 

(16) 

1H7 

(17) 

P 

9 

(187:114) 

VLU 

(8) 

RNC 

( 1) 

RYL 

( 7) 

AO 

(13) 

AA3 

(18) 

RA5 

18 

(114:119) 

TCH 

(8) 

EL2 

( 9) 

ELS 

(11) 

EL 

(28) 

L 

(23) 

L2 

11 

(119:138) 

SIL 

(8) 

- 

(19) 

TH 

(25) 

F 

(26) 

«> 

(38) 

Z 

12 

(138:135) 

NflS 

(8) 

ni 

( 6) 

n 

( 9) 

UU 

(18) 

EH 

(21) 

AA4 

13 

(135:144) 

VIY 

(8) 

IMS 

( 3) 

IH2 

( 6) 

IH3 

( 6) 

AYR 

( 7) 

IH 

14 

(144:153) 

HHY  - 

(8) 

lY 

( 4) 

Y 

( 5) 

IH5 

( 7) 

IH7 

( 8) 

IH3 

IS 

(153:168) 

LVF 

(8) 

6 

( 1) 

0 

( 1) 

HH2 

( 6) 

TH 

( 7) 

OH 

Figure  4.2:  Partial  Segment-Label  Output  for  "Please  help  me". 


Consider  the  third  sylpart  of  the  utterance,  the  coda  "Z"  of  the  syllable  of 
"Please",  between  the  horizontal  lines:  these  lines  represent  the  times  given  by  the 
hand-segmentation.  The  times  correspond  to  a top-rated  segment  label  pattern  of  "[S  S 
T]"  (segments  5 through  7)  for  the  coda  "Z".  However,  the  pattern  is  stored  as  "[S  T]", 
which,  during  recognition,  will  match  any  number  of  S’s  followed  by  any  number  of  Ps. 

5 At  one  time,  segment  duration  information  was  learned  as  part  of  the  pattern.  This 
was  discontinued  when  it  was  found  that  the  duration  information  did  not  improve  the 
recognition  results.  It  seems  that  segment  duration  information  is  very  dependent  on 
stress,  rate  of  speech,  intonation,  and  other  factors  above  the  sylpart  level  which  we 
were  not  prepared  to  include. 


50 


I 


This  compression  of  identical  adjacent  labels  is  the  first  method  of  pattern  merging.  The 
second  is  to  use  ■ non-top  choice  label  if  1)  its  rating  is  below  a set  threshold  and  2) 
the  label  is  the  best  choice  (according  to  the  label  ratings  of  the  current  segment)  of  all 
previously  learned  patterns  which  match  the  new  segment  pattern  up  to  the  current 
segment.  This  is  done  as  follows;  a pattern  is  added  to  the  tree  segment  by  segment, 
each  segment  label  becoming  a node.  Before  a segment  label  is  added,  all  following 
nodes  are  checked  to  find  the  best  matching  segment-label  (previously  learned)  to  the 
current  segment.  If  the  rating  of  the  best  one  is  below  (i.e.,  better  than)  the  set 
threshold,  its  node  is  taken  as  part  of  the  current  pattern.  If  none  of  the  labels  has  a 
good  enough  rating,  or  there  are  no  next  nodes,  the  top-rated  label  of  the  current 
segment  is  inserted  as  a new  node.  For  example,  suppose  the  segment  pattern  "[IHA]” 
has  been  learned  for  the  vowel  "lY",  but  the  pattern  "[IHe]"  has  not  been  learned.  The 
learning  program  would  interpret  the  segment  pattern  for  the  vowel  "lY”  of  "Please" 
(segment  number  3)  as  having  already  been  learned,  if  the  threshold  is  greater  than  2 
(which  is  the  rating  of  "[IH4]"  in  segment  3).  The  only  change  in  the  system  would  be  in 
the  count  of  the  number  of  times  the  "[1H4]"  pattern  had  been  seen  for  the  vowel  "lY". 

4.3.3  Segment  Pattern  Learning  for  Vowels 

In  addition  to  the  above  features,  there  are  three  others  for  vowels:  syllable 
nuclei,  nonsequential  pattern  storage,  and  vowel  sequences.  To  discuss  these  features 
we  will  refer  to  Figure  4.3,  showing  the  top  segment-labe!  choices  for  the  utterance  "Do 
any  papers  cite  Nilsson?"  divided  into  sylpart  sections  defined  by  the  hand- 
segmentation. 

4.3.3. 1 Syllable  Nuclei 

As  a syllable-based  recognition  method,  Noah  needs  to  find  where  the  syllables 
are.  This  is  done  by  locating  syllable  nuclei.  It  was  found  experimentally  that  the 
segment  class  labels  generated  by  the  segmenter-labeler  provided  a fairly  reliable 
indication  of  syllable  nuclei.  Segment  class  labels  are  a by-product  of  the  segmenter:  In 
order  for  the  segmenter  to  divide  the  acoustic  description  of  speech  into  similar  parts,  it 
must  analyize  and  characterize  the  speech  (usually  in  10ms  samples).  A segment  class 
label  is  simply  this  characterization  for  a complete  segment.  (Appendix  B gives  a list  of 
the  segment  class  labels).  Any  sequence  of  (one  or  more)  class  labels  whose  names 
befcin  with  a "V"  (which  stands  for  "vowel")  implies  a syllable  nucleus.  For  example,  the 
class  label  of  segment  number  11  indicates  the  syllable  nucleus  of  the  first  syllable  of 
the  word  "Papers". 

4.3.3.2  Nonsequential  Pattern  Storage 


4 Acquisition  of  Knowledge 


S«9Mnt 

TIm 

S«gMn( 

Top 

Hond-Sogoonlat Ion 

No. 

C«nt-s«e. 

ClMs-Labtl 

Sogoont-Lobtl 

Sylport 

Coda 

TImos 

1 

(51i63) 

SIL 

• 

D Oncot 

(61:64) 

2 

(63:64) 

RSH 

PH 

3 

(64:69) 

MRS 

UN 

UN  VoHol 

V 

(64:73) 

4 

(69:76) 

VCN 

ER2 

S 

(76:79) 

VFT 

EH4 

EH  VoNol 

V 

(73:79) 

6 

(79:82) 

DCN 

ni 

N Cod* 

(79:82) 

7 

(82:88) 

VFT 

lY 

lY  VoHol 

(82:88) 

8 

(88:98) 

SIL 

• 

P Onoot 

(88: 188) 

9 

(98:188) 

RSP 

P 

18 

(188: 184) 

TCH 

EYL 

EY  VoMol 

(188:112) 

11 

(184:189) 

•VFT 

EVC 

12 

(189:112) 

TCN 

IY3 

13 

(112:119) 

SIL 

• 

P OnttI 

(112:128) 

14 

(119:128) 

RSP 

V 

15 

(128:126) 

vno 

ER 

ER  VoHOl 

(128:129) 

16 

(126:129) 

HHV 

IH2 

17 

(129:137) 

HUF 

S 

2 Coda 

(129:137) 

18 

(137:141) 

HUF 

S 

S Onset 

(137:141) 

19 

(141:152) 

VRR 

RYL 

RY  VoHOl 

(141:154) 

28 

(152:155) 

DCN 

IH2 

21 

(155:161) 

LVF 

NX 

T Coda 

(154:158) 

22 

(161:165) 

NGL 

NX 

N Onsat 

(158: 165) 

23 

(165:174) 

TCN 

IH3 

IH  VoNal 

C 

(165:174) 

24 

(174:183) 

VBK 

ELS 

L Coda 

(174:183) 

25 

(183:189) 

HUF 

S 

S Onaal 

(183:195) 

26 

(189:195) 

HUF 

S 

27 

(195:198) 

vno 

RH 

RX  VoHal 

(195:281) 

28 

(198:282) 

VCN 

L2 

• 

29 

(282:286) 

LVF 

NX 

N Coda 

(281:213) 

38 

(286:213) 

LVF 

Figure  4.3:  Top  Segment  Labels  and  Hand-Segmentation  for 
"Do  any  papers  cite  Nilsson?" 


52 


segment  patterns,  Noah  learns  the  patterns  beginning  with  the  syllable  nucleus. 
Consider  again  the  first  vowel  of  "papers",  "EY",  which  spans  segments  10  through  12. 
Segment  11  is  put  first  in  the  tree,  then  segment  10  (and  then  earlier  segments  if  they 
are  part  of  the  vowel),  and  finally  segment  12  (and  any  later  segments).  Thus,  the 
pattern  will  appear  in  the  segment-vowel  tree  as  a path  through  the  nodes  marked 
"EYC",  "EYL",  "IY3". 

4.3.3.3  Vowel  Sequence  Learning 

Often  vowels  run  together  in  speech,  sharing  the  same  syllable  nucleus.  At 
other  times  the  segment  pattern  of  a vowel  is  greatly  modified  by  the  preceding  onset 
or  the  following  coda.  Sometimes  this  modification  of  the  vowel  pattern  is  the  only  clue 
to  identity  of  the  onset  or  coda.  It  is  for  these  reasons  that  a method  of  "vowel 
sequence  learning"  was  developed.  Basically,  the  method  permits  forming  a "vowel 
sequence"  by  concatenating  a vowel  with  another  vowel,  with  the  last  phoneme  of  an 
onset,  or  with  the  first  phoneme  of  a coda  whenever  the  speech  waveform  gives  no 
evidence  of  distinct  sylparts.  The  segment  pattern  spanned  by  this  vowel  sequence  is 
learned®.  For  example,  the  first  two  vowels  of  "Do  any"  (segments  3,  4,  and  5 in  Figure 
4.3)  share  the  same  syllable  nucleus  (segments  4 and  5)^.  The  hand-segmentation  code 
of  "V"  (for  Vowel  concatenation)  in  the  "Code"  column  after  each  vowel  indicates  that 
these  vowels  should  be  joined  and  learned  as  a vowel  sequence.  The  label  UW-EH  is 
added  to  the  vowel  sequence  lexicon  (if  it  is  not  already  there)  and  the  segment  pattern 
from  segments  3 through  5 is  added  to  the  segment-vowel  tree.  Recognition  of  this 
pattern  identifies  the  vowel  sequence  UW-EH  which  is  expanded  and  treated  as  if  two 
separate  vowels  had  been  recognized. 

Another  example  is  found  in  segments  23  and  24.  Here  the  "L"  of  "Nilsson"  has 
become  a syllabic  /L/  with  the  vowel  appearing  as  a "tail  consonant"  (TCN)  before  it. 
The  X”  (for  Coda  concafenation)  in  the  hand  segmentation  tells  the  learning  program  to 
form  the  vowel  sequence  IH-L  and  learn  segments  23  and  24  as  its  pattern.  More 
complex  vowel  sequences  occur.  A common  one  is  lY-R-IY  as  found  in  the  word 
"Theory".  In  hand-segmenting  1350  syllable  nuclei,  235  (177.)  were  found  to  be  vowel 
sequences.  These  included  94  different  vowel  sequences,  one-third  of  which  were 
liquid-vowel  or  vowel-liquid  pairs.  (Appendix  B lists  the  vowel  sequences  found  for  174 
training  utterances). 

Vowel  sequence  learning  is  the  first  method  for  handling  coarticulation  problems 
at  the  sylpart  level;  the  second  is  context  learning. 

6 During  recognition,  a match  of  the  segment  pattern  with  the  new  segments  identifies 
the  same  vowel  sequence,  which  is  than  expanded  into  its  parts. 

7 As  mentioned  above,  any  sequence  of  "V"  class  segments  defines  a nucleus. 


r 


T 


4 Acquisition  of  Knowledge 
4.3.4  Context  Learning 


53 


When  learning  the  segment  pattern  for  a sylpart,  selected  context  about  the 
pattern  is  also  learned.  Though  this  context  could  be  of  several  types  (e.g.,  sylparts  to 
the  left  and  right  in  utterance)  stress  of  surrounding  syllables,  relative  amplitude  of  the 
sylpart,  or  position  of  the  sylpart  in  the  utterance  — to  account  for  end  of  utterance 
effects^),  we  have  chosen  to  learn  the  one-segment  pattern  to  the  left  and  to  the  right 
of  the  sylpart’s  segment  pattern.  These  adjacent  segments  of  a segment  pattern  give 
the  most  relevant  information  about  why  the  segment  pattern  for  the  sylpart  appears  as 
it  does.  Other  factors,  such  as  the  types  of  contexts  suggested  above,  influence  the 
pattern,  but  the  greatest  influence  is  given  by  the  speech  immediately  before  and  after 
the  pattern.  During  recognition  this  context  is  used  to  limit  the  possible  interpretations 
of  a segment  pattern. 

In  Figure  4.3,  the  context  for  the  first  vowel  of  "papers”  (segments  10,  1 1 and 
12)  are  segments  9 and  13.  Just  as  in  learning  sylpart  segment  patterns,  the  top 
segment  labels  of  segments  9 and  13  are  not  necessarily  chosen  for  the  context.  If 
some  other  labels  had  already  been  learned  for  the  same  sylpart  and  segment  paHern, 
and  both  labels  have  ratings  (in  their  respective  segments)  higher  than  a set  threshoid,^ 
no  new  context  will  be  learned.  Rather,  the  count  of  the  number  of  times  the  old 
context  appeared  will  be  incremented. 

4.3.5  Hand-made  Segment  Patterns 

The  manual  method  of  knowledge  acquisition  was  used  for  some  of  the  less 
frequently  occuring  onsets  and  codas.  Of  the  83  onsets  and  131  codas  occurring  in  the 
20,000-word  dictionary,  24  onsets  and  58  codas  did  not  occur  at  all  in  our  training  and 
test  sentences  (i.e.,  in  the  sentences  made  from  the  1000-word  dictionary).  Since  these 
onsets  and  codas  make  up  only  and  l.2t  of  all  onset  and  coda  occurences, 
respectively,  in  the  20,000-word  dictionary,  we  choose  to  eliminate  them  from  the 
lexicons.  Any  word  using  one  of  these  sylparts  is  not  used  in  any  of  the  test 
vocabularies.  Of  the  remaining  onsets  and  codas,  14  onsets  and  26  codas  did  not  occur, 
or  occured  infrequently,  in  the  training  sentences.  For  these  sylparts,  we  entered 
hand-made  patterns  into  the  segment-sylpart  trees.  A hand-made  pattern  for  a sylpart 
was  made  by  combining  and  modifying  the  patterns  stored  for  similar  sylparts.  For 
example,  the  pattern  for  the  onset  "S  P L"  was  made  to  be  "[S  - PL]".  "[SF  fh®  w’osl 


8 We  have  experimented  some  with  the  last  two,  but  these  contexts  are  not  used 
currently. 

9 Since  the  identity  of  a sylpart  depends  less  on  the  context  of  its  segment  pattern 
than  on  the  segment  pattern,  this  threshold  is  less  restrictive  than  for  the  threshold 
used  for  the  segment  labels  in  the  segment  patterns. 


54 

common  pattern  for  onset  "S"  and  "[-  PL]"  is  the  most  common  pattern  for  the  onset  "P 
L".  (Appendix  B lists  the  sylparts  used  by  Noah), 


aijnieh rV 


Chapter  5;  Recogniiion 


5.1  Introduction 

Recognition  is  a bottom-up  process  through  four  levels  for  the  Noah  word 
hypothesizer:  1)  Syllable  nuclei  are  recognized  at  the  segment  level;  2)  Vowels,  onsets, 
and  codas  are  hypothesized  and  rated  at  the  sylpart  level,  based  on  segment  labels;  3) 
syllables  are  hypothesized  and  rated  at  the  syllable  level,  based  on  the  sylpart 
hypotheses;  and  4)  words  are  hypothesized  at  the  word  level,  based  on  the  syllable 
hypotheses.  This  chapter  describes  the  major  steps  in  this  recognition  process  and 
explains  how  the  ratings  are  computed  for  the  hypotheses  at  each  level. 

Before  the  recognition  algorithm  is  discussed,  it  is  necessary  to  distinguish 
between  "lexical  items"  and  "hypotheses":  Each  level  has  a lexicon  associated  with  it. 
For  example,  the  syllable  level  has  a lexicon  of  syllables  made  up  of  all  syllables  which 
occur  in  any  vocabulary  word.  The  sylpart  level  has  a lexicon,  which  for  convenience, 
has  been  divided  into  the  onset,  vowel,  and  coda  lexicons.  A lexical  item  is  one  entry  in 
a lexicon.  In  contrast,  a hypothesis  is  a suggested  realization  of  a lexical  item  in  the 
speech  utterance.  It  has  a beginning  and  ending  time  (or  segment  number  for  this 
system),  a lexical  index  which  points  to  the  lexical  item  it  represents  and  a rating, 
measuring  the  likelihood  that  the  lexical  Item  occurs  at  the  given  place  in  the  utterance. 

5.2  The  Recognition  Algorithm 

5.2.1  Information  Needed  for  Recognition 

Hypothesis  X at  level  i is  based  on  a sequence  of  adjacent  hypotheses  at  the 
next  lower  level,  level  i-i,  as  explained  in  Chapter  2.  Three  types  of  information  are 
combined  to  produce  hypothesis  X:  1)  The  sequence  of  lexical  items  at  level  i-1  that 
define  hypothesis  X,  2)  the  hypotheses  at  level  i-1  that  have  been  produced  previously, 
and  3)  the  adjacency  information  of  hypotheses  at  level  i-1.  Each  of  these  types  of 
information  will  now  be  described. 

The  sequence  information  is  acquired  during  training  of  the  system  and  is  stored 


in  trees  between  adjacent  levels.  Chapter  4 discussed  the  acquisition  of  this 
information  and  Chapter  3 gave  examples  of  its  storage  in  trees. 

Information  about  what  hypotheses  exist  at  level  i-1  is  generated  and  stored  as 
recognition  proceeds  up  through  each  level.  Figure  5.1  shows  sample  hypotheses  at 
each  of  the  four  levels  for  the  last  syllable  in  the  utterance  Td  like  to  see  the  menus”. 
The  time  in  centi-seconds  measured  from  the  beginning  of  the  utterance  is  shown  for 
each  segment  on  the  bottom  line.  Immediately  above  the  time  is  the  sequential  number 
of  the  segment.  Displayed  with  each  hypothesis  is  its  lexical  name,  its  rating  in 
parentheses  (lower  numbers  indicate  better  ratings)  and,  except  for  the  segment 
hypotheses,  its  beginning  and  ending  segments  (indicated  by  a time  line).  Although  only 
the  top-five  segment  labels  are  shown,  the  segmenter-labeler  produces  the  ratings  for 
each  of  the  98  labels  in  the  segment  lexicon  for  each  segment. 

Three  things  should  be  noted  concerning  the  hypotheses  at  the  sylpart  level:  1) 
One  type  of  sylpart  (vowel,  onset,  or  coda)  can  overlap  with  other  types.  In  other 
words,  a syllable  region  is  not  divided  into  regions  of  onsets,  vowels,  and  codas;  the 
position  of  a sylpart  is  based  on  the  match  of  its  stored  segment  pattern  with  the 
segment  label  hypotheses.  2)  A null  onset  hypothesis  (with  a duration  of  zero)  exists  at 
each  segment  position  in  which  a vowel  hypothesis  begins  and  a null  coda  hypothesis 
exists  at  each  segment  position  in  which  a vowel  hypothesis  ends.  None  of  these 
hypotheses  is  displayed  here.  3)  The  hypotheses  "Y  UW”  and  "lYAX”  are  examples  of 
"vowel  sequences".  They  will  be  discussed  later  (Section  5.2.3.2). 

Two  hypotheses  at  the  same  level  are  adjacent  if  one  hypotheses  ends  at 
segment  number  N and  the  other  begins  at  segment  number  N+1.  This  makes  storing 
adjacency  information  simple  --  a list  pointing  to  all  hypotheses  of  the  same  level 
beginning  at  segment  number  N+I  gives  the  set  of  all  hypotheses  adjacent  to  any 
hypothesis  of  that  levi^l  ending  at  segment  number  N.  (These  adjacency  lists  are  not 
necessary  at  the  segment  level,  where  adjacency  information  is  implicit  in  the  storage 
method  or  at  the  word  level  were  adjacency  information  is  not  needed  by  the  word 
hypothesizer).  This  simple  definition  of  adjacency  requires  a new  hypothesis  for  every 
place  the  pattern  of  a lexical  item  matches  the  lower  level.  For  example,  the  vowel 
sequence  "Y  UW"  appears  four  times,  spanning  different  segments.  However,  though  the 
alternative  method  of  storing  begin-time  and  end-time  "fuzziness"  for  each  hypothesis 
saves  storage  it  does  not  permit  rating  each  time  span  separately,  and  it  needs 
sophisticated  adjacency  tests  between  hypotheses.  (This  is  the  method  used  by  the 
Hearsay  II  system). 

5.2.2  One  Step  in  Recognition 

The  sequences  of  adjacent  hypotheses  at  level  i-1  are  compared  to  the 


5 Recognition 


57 


WORD  LEVEL  ...  MENUS  (0) 

< ME(12) 

<..... USE  (19) 

< AN  (23).. 

<. NEUS(33) 


SYLLABLE 

LEVEL 


<  M IY(12) 

<  Y UVI  Z(19) 

<  AX  N(23}.... 

<  IH  N(25) 

<  MY  UU(27) 

<  N y UU  2(33)... 


SYLLABLE-PART 

LEVEL 


Onsets  Vouels  Codas 

<  IH(13) <V(15)> 

<  Y UM(13) <....Z(12)...> 


<Y(13)>  <..Y  UU(17)...>  <....N(26)....> 


<M(13)> 

< lYAXdB)........, 

.>  <N(23)> 

<N(13)> 

<....Y  UU(18) 

.>  < V Z(26) 

.> 

<0H(29)><....Y  UU(18). 

.> 

<...M  Y 

(17).. > < UU(21), 

.> 

Top  five 

NX(0) 

Y(0) 

iy(0) 

UU3(0) 

Ml  (0) 

D(0) 

S(0) 

labels: 

M(8) 

lY(ll) 

Y(13) 

Y(2) 

UU(l) 

Z(12) 

SH(47) 

SEGMENT 

N(8) 

1H5(29) 

EYC(17) 

ER3(2) 

M(5) 

□H(16) 

ZH(5B) 

LEVEL 

Ml (10) 

G(32) 

1Y2(24) 

•IH2(4) 

1H7(5) 

TH(17) 

T(50) 

Segment 

EM (10) 

IH3(34) 

ER3(26) 

IH5(7) 

EM  (6) 

V(18) 

PH (53) 

class: 

NAS 

FRV 

VSW 

VCN 

NVF 

LUF 

FRU 

Segment  Ui 

20 

21 

22 

23 

24 

25 

26 

(cent i -secs) 

117 

125 

130 

133 

139 

146 

152: 

Figure  5.1:  Sample  Hypotheses  at  Each  Level 

for  the  Last  Syllable  of  "MENUS 


58 


sequences  of  lexical  items  represented  by  the  nodes  of  the  tree  between  levels  i-1  and 
level  i to  produce  the  hypotheses  at  level  i.  Assume  that  the  sequence  of  m (m^l) 
hypotheses  [h|,h2t..-ih^]  at  level  i-1  have  been  matched  with  m nodes  [n|,n2r">n|,]  in  the 
tree.  A match  between  hypothesis  h,  and  node  n,  simply  means  that  they  have  the 
same  lexical  name.  The  next  step  is  to  compare  all  sons  of  node  n^ 

***  hypotheses  adjacent  to  hypothesis  h^ 
* match  is  found  between  node  , and  some  hypothesis 
hp^i  j,  then  node  n^j^,  (representing  a unique  path  in  the  tree)  is  saved  to  be 
extended  later.  If  the  sequence  of  lexical  items  given  by  the  path  defines  a lexical  item 
at  level  I,  an  hypothesis  is  made  at  level  i with  begin-  and  end-segment  numbers  and 
rating  obtained  from  the  hypotheses  [hj,h2,.-(h|^h^j^  j}  Since  different  paths  may 
define  the  same  lexical  item  or  different  sequences  of  hypotheses  may  match  the  same 
path  in  the  tree,  duplicate  hypotheses  may  be  generated,  differing  only  in  rating.  All 
duplicate  hypotheses  except  the  best-rated  hypothesis  are  deleted. 

5.2.3  Features  of  Recognition  Unique  to  the  Lower  Levels 

5.2.3. 1 Segment  Level  to  Sylpart  Level 

There  are  three  trees  between  the  segment  level  and  the  sylpart  level:  the 
segment-vowel  tree,  the  segment-onset  tree,  and  the  segment-coda  tree.  Recognition 
begins  with  the  segment-vowel  tree,  starting  at  the  first  segment  of  a syllable  nucleus 
(segment  22  in  Figure  5.1)^.  The  segment  label  hypotheses  at  this  segment  are  matched 
with  the  first  nodes  of  the  tree.  Since  the  segment  patterns  in  the  segment-vowel  tree 
are  not  necessarily  stored  in  a left-to-right  sequence  (see  Section  4.3.3.2  on 
"nonsequential  pattern  storage"),  the-  next  nodes  in  the  tree  determine  whether  the 
segments  to  the  right  or  left  are  compared  next.  Thus,  the  order  in  which  the  segments 
are  compared  is  part  of  the  pattern  stored  in  the  tree.  A vowel  hypothesized  will  have 
a beginning  segment  number  equal  to  the  leftmost  segment  compared  and  an  ending 
segment  number  equal  to  the  rightmost  segment  compared  for  its  pattern. 

Any  segment  immediately  to  the  left  of  the  first  segment  of  a vowel  hypothesis 
becomes  a starting  segment  for  searching  the  segment-onset  tree.  For  this  iree, 
segments  are  compared  right-to-lefl  (i.e.,  back  in  time)  as  the  tree  is  searched. 
Segments  for  the  segment-coda  tree  are  compared  in  the  forward  direction,  starting 
with  any  segment  immediately  to  the  right  of  the  last  segment  of  a vowel  hypothesis. 

1 To  avoid  p times  q comparisons,  the  sons  of  each  node  and  the  hypotheses  on  each 
list  of  adjacent  hypotheses  are  ordered  by  the  lexical  number  of  their  lexical  items. 
This  permits  at  most  p-t-q  comparisons. 

2 As  described  in  Chapter  4,  a syllable  nuclei  is  a sequence  of  contiguous  segments 
eich  of  which  has  a segment  class  name  beginning  with  a "V”. 


5 Recognition 


59 


Unique  to  these  segment-sylpart  trees  is  the  ability  of  one  node  in  a tree  to' 
span  more  than,  one  segment  (see  Section  4.3.2  on  "segment  pattern  learning").  For 
example,  a vowel  with  the  segment  pattern  "[Y]"  may  be  hypothesized  to  span  segments 
21  through  23  since  a "Y"  segment  label  is  present  in  each  segment  with  a good  (i.e., 
numerically  low)  rating.  This  form  of  a dynamic  programming  algorithm  permits  Noah  to 
use  nonsegmented  speech.  Suppose  the  speech  corresponding  to  segments  21  through 
23  of  the  example  had  been  divided  into  10  milli-second  samples  and  then  labeled.  If 
the  "Y"  label  was  rated  well  in  each  sample  (as  it  was  in  each  segment)  the  same  vowel 
would  be  hypothesized  to  span  the  same  part  of  the  utterance.  However,  segmenting 
reduces  the  cost  of  recognition.  In  this  case,  only  3 segments  are  looked  at  rather  than 
140  (10  milli-second)  samples  (i.e.,  the  time  span  between  centi-seconds  125  and  139). 

5.2.3.2  Sylpart  Level  to  Syllable  Level 

The  search  of  the  sylpart-syllable  ttee  begins  at  the  vowels  in  each  syllable 
region,  continues  with  the  onsets,  and  finishes  with  the  codas.  This  is  done  in  order  to 
handle  null  onsets  and  null  codas  easily. 

Vowel  sequences  are  expanded  during  this  search  and  treated  as  a separate 
sequence  of  sylparts.  For  example,  the  vowel  sequence  "lY  R lY"  is  expanded  into  its 
parts.  When  the  first  "lY"  begins  the  sylpart-syllable  tree  search,  the  "R"  is  used  an 
optional  coda  — the  only  other  option  being  a null  coda.  When  the  second  "lY"  is  used 
to  start  the  sylpart-syllable  tree  search,  the  "R"  is  used  as  an  optional  onset  — the 
other  option  being  a null  onset.  Any  inital  non-vowel  in  a vowel  sequence  (such  as  the 
"Y"  of  "Y  UW")  is  appended  to  the  end  of  adjacent  onsets  whenever  such  a joining 
results  in  a legal  onset  name.  For  example,  the  "Y"  in  the  vowel  sequence  hypothesis  "Y 
UW"  spanning  segments  21  through  23  is  appended  to  the  onsets  ending  at  segment  20, 
producing  onsets  "M  Y"  and  "N  Y",  both  legal  onsets.  Similarly,  any  final  non-vowel  in  a 
vowel  sequence  is  appended  to  the  begining  of  adjacent  codas  if  it  results  in  a legal 
coda  name.  Vowel  sequences  with  more  than  one  vowel  produce  syllables  with  special 
adjacency  restrictions. 

Consider  the  vowel  sequence  "lYAX".  This  vowel  sequence  produces  the  set  of 
syllables  characterized  by  the  sequence:  <some  onset><the  vowel  ”IY">  <a  null  coda> 
and  the  set  of  syllables  characterized  by  <a  null  onsetxthe  vowel  "AX">  <some  coda>. 
Syllables  from  the  first  set  are  adjacent  on  the  right  only  to  syllables  in  the  second  set. 
The  syllable  "M  lY"  in  Figure  5.1  is  adjacent  on  its  right  only  to  the  syllable  "AX  N" 
since  both  are  based  on  the  vowel  sequence  "lYAX".  (Their  times  overlap  because  each 
uses  the  complete  segment  span  of  the  vowel  sequence). 


S.2.4  Parall*!  Racosnition 

Although  a parallel  recognition  algorithm  has  not  been  implemented,  the 
recognition  algorithm  given  above  permits  a high  degree  of  parallel  processing. 
Matching  the  sons  of  a node  in  a tree  with  a set  of  hypotheses  depends  only  on  having 
matched  the  previous  nodes  in  the  tree  and  having  the  complete  set  of  hypotheses. 
Thus,  processes  can  work  at  the  same  time  in  different  branches  of  the  same  tree  in  the 
same  place  in  the  utterance,  or  in  the  same  tree  in  different  parts  of  the  utterance,  or 
in  different  trees  at  different  levels.  One  possible  division  of  processing  is  to  use  one 
processor  for  each  syllable  region.  In  each  region  a processor  would  first  hypothesize 
vowels,  onsets,  and  codas,  then  hypothesize  syllables,  and  finally  search  the  syllable- 
word  tree  starting  with  all  syllables  in  its  region  and  continuing  with  other  regions  as 
the  syllables  became  available.  Depending  on  the  number  of  processors,  this  scheme 
would  decrease  the  recognition  time  for  the  typical  utterance  by  more  than  an  order  of 
magnitude. 


5.3  Rating  of  Hypotheses 

There  is  no  end  to  changing  rating  methods,  adjusting  thresholds,  and  generally 
trying  to  get  optimal  results  from  errorful  input.  The  rating  methods  reported  here 
have  no  proof  of  optimality  but  give  the  best  results  so  far  and  seem  to  make  sense. 
Ratings  are  used  by  the  word  hypothesizer  to  report  the  likelihood  that  a word  was 
spoken  at  a particular  place  in  the  utterance  and  to  limit  the  search  of  trees  at  each 
level.  As  has  been  stated,  the  rating  of  each  hypothesis  ranges  from  0 to  some  upper 
limit,  with  0 corresponding  to  a perfect  score  (similar  to  the  negation  of  (he  log  of 
likelihood  probabilities).  The  rating  of  a hypothesis  made  up  of  a sequence  of 
hypotheses  at  the  next  lower  level  is  equal  to  the  sum  of  the  ratings  of  the  lower 
hypotheses  minus  a normalizing  value  depending  on  the  length  of  the  sequence.  This 
sum  is  computed  as  the  path  through  the  tree  is  searched.  If  at  any  node  in  the  search 
this  sum  minus  any  possible  future  normalizing  value  becomes  greater  than  a set 
threshold  (different  for  each  tree),  the  search  of  the  tree  at  the  node  and  beyond  is 
pruned.  Thus,  searching  for  hypotheses  that  will  be  rated  poorly  is  aborted  early. 

In  the  case  of  a sylpart,  the  final  rating  of  the  hypothesis  also  includes  a context 
rating  and  a weight  penalty.  As  described  in  Chapter  4,  a set  of  context  segment  label 
pairs  is  stored  in  the  tree  for  each  segment  pattern  of  each  sylpart.  These  context 
segment  label  pairs  are  the  top  segment  label  hypotheses^  found  to  the  left  and  to  the 
right  of  the  segment  pattern  each  time  the  pattern  was  learned  for  the  sylpart.  The 


3 Subject  to  the  modification  described  in  Section  4.3.4  on  "context  learning". 


5 Recognition 


61 


context  rating  is  computed  during  recognition  by  finding  the  stored  pair  of  context 
segment  labels  which  match  best  with  the  segments  to  the  left  and  right  of  the  place 
were  the  pattern  is  currently  matched.  The  sum  of  the  ratings  of  these  best  segment 
labels  is  divided  by  a factor  based  on  the  length  of  the  segment  pattern  and  then  added 
to  the  rating  of  the  sylpart  hypothesis.  The  reasoning  behind  this  computation  is  that 
the  closer  the  current  context  for  the  segment  pattern  of  a sylpart  matches  with  some 
previously  learned  context,  the  more  likely  the  segment  pattern  should  be  interpreted 
as  representing  the  same  sylpart.  Also,  the  shorter  the  segment  pattern,  the  more 
influence  the  context  segments  will  have  on  it.  In  the  case  of  a zero-length  segment 
pattern,  the  rating  of  the  context  segment  labels  completely  determines  the  rating  of 
the  hypothesis.  An  example  of  this  is  the  onset  ”D"  in  the  context  of  a nasal  and  a 
vowel  as  might  occur  in  the  word  “Standing"  — (S  T AE  N - D IH  NX). 


Segment  patterns: 

[-  PHI 

[ 4-  ] 

t SH  ] 

I-  PJ 

. . . Row  Max 

Onset  sylpart:  P 

2 

0 

0 

19 

19 

T 

1 

1 

2 

0 

10 

K 

10 

0 

0 

1 

...  10 

D Y 

1 

2 

0 

0 

2 

SH 

0 

0 

16 

0 

16 

• • • 

Column  Sum: 

• • • 

16 

• e e 

23 

• • a 

34 

• • • 

22 

-■ 

Figure  5.2:  Sample  Frequency  Counts  of  Segment  Patterns  for  Onsets 

The  weight  penalty  penalizes  a sylpart  hypothesis  when  a)  it  is  based  on  a 
segment  pattern  which  occurs  much  more  frequently  for  other  sylparts,  and  b)  Its 
empirically  observed  conditional  probability  is  low  for  the  pattern.  Thus,  the  weight 
penalty  attempts  to  distinguish  between  sylparts  having  the  same  segment  pattern. 
Figure  5.2  shows  a portion  of  a training  frequency  count  matrix  for  the  segment 
patterns  of  onsets.  For  instance,  one  can  tell  from  this  figure  that  the  segment  pattern 
"[-  PH]"  appeared  twice  for  the  onset  "P"  in  the  training.  The  numbers  in  the  rightmost 
column  give  the  number  of  times  the  most  common  segment  pattern  appeared  for  the 
onset  of  the  row.  The  bottom  row  shows  the  total  number  of  times  each  segment 
pattern  appeared.  Let  R,  be  the  row  maximum  for  the  sylpart  Sp  Cj  be  the  column  sum 
for  the  segment  pattern  Pj  and  F,j  be  the  frequency  count  of  pattern  Pj  for  sylpart  S,. 
The  weight  penalty,  W,j,  for  S,  and  Pj  is  obtained  by: 

W,j  - <maximum  weight  penalty>  ♦ (1  - (F|j/Cj  max  F|j/R,)) 

The  penalty  W,j  is  based  on  empirically  observed  Prob(S||Pj)  ■ f^ij/Cj  limited 
by  the  ratio  of  F,j  to  Rj.  For  example,  Prob("D  Y"|[*>])  - 2/23,  but  since  "D  Y"  occurs 


62 


infrequently  in  training^  its  most  common  occuring  pattern  occurs  only  twice.  Thus,  the 
pattern  "M"  receives  no  weight  penalty  for  the  onset  "D  Y".  On  the  other  hand, 
consider  the  pattern  PHr  for  the  onset  "P".  The  Prob<"P“|[-  PH])  - 2/16  is  small  and 
the  ratio  of  the  occurrences  of  "[-  PH]"  for  "P"  to  the  maximum  occurrences  of  any 
pattern  for  "P"  is  even  smaller  (-  2/19)  so  that  the  weight  penalty  is  high.  It  is  not 
very  likely  that  pattern  "[-  PH]"  will  represent  onset  "P". 


Rating  method: 

Segment 

I nc 1 ud i ng 

Including  Context 

pattern  match 

Context 

and  Height  Pena 

Youel  results: 

t correct 

81% 

81% 

79% 

Average  rank 

5.1 

3.1 

2.7 

ff  hypotheses/syl . 

22.7 

19.5 

13.3 

Onset  results: 

Z correct 

93% 

93% 

91% 

Average  rank 

6.6 

4.3 

3.7 

ff  hypotheses/syl. 

19.0 

16.6 

13.1 

Coda  results: 

X correct 

90% 

90% 

90% 

Average  rank 

5.8 

3.9 

3.9 

ft  hypotheses/syl. 

16.8 

15.0 

12.8 

Figure  5.3:  Sylpart  Recognition  Results  for  215  Syllables 


Figure  5.3  shows  the  effect  of  including  the  context  rating  and  the  weight 
penalty  with  the  rating  of  the  segment  pattern  match  for  the  recognition  of  sylparts  for 
215  syllables.  The  rank  of  a correct  hypothesis  is  the  number  of  incorrect  competing 
hypotheses  rated  better  than  it,  plus  one.  This  rank  value  is  averaged  for  all  correct 
hypotheses  for  each  type  of  sylpart  to  give  the  average  rank.  Both  the  addition  of  the 
context  rating  and  the  weight  penalty  gives  the  desired  decrease  in  the  average  rank  as 
well  as  eliminating  many  incorrect  hypotheses.  However,  the  weight  penalty  increased 
the  rating  for  a few  of  the  correct  vowel  hypotheses  and  onset  hypotheses  above  the 
acceptance  threshold  so  that  they  were  eliminated.  The  addition  of  context  decreases 
the  average  rank  for  the  sylparts  by  about  357.  The  addition  of  the  weight  penalty 

decreases  the  average  rank  another  lOZ 


63 


5 Recognition 

5.4  Propagation  of  Segment  Label  Confusion  During  Recognition 

A frequent  lament  of  speech  system  designers  is:  "If  only  we  had  a better 
segmenter-labelef".  Although  such  a desire  is  almost  equivalent  to  a wish  for  cleaner 
and  more  easily  recognized  speech,  it  is  clear  that  the  word  hypothesizer  would 
certainly  profit  from  a better  segmenter-labeler.  The  effect  of  segment-label  confusion 
on  the  word  hypothesizer  is  shown  here  by  using  the  measure  of  hypothesis  confusion 
developed  in  Section  2.4.  Note  that  this  confusion  measure  says  nothing  about  correct 
or  incorrect  hypotheses;  it  only  measures  the  confusion  (i.e.,  the  uncertainty  of 
information,  in  a nontechnical  sense)  of  a set  of  hypotheses  (segment -labels,  sylparts, 
syllables,  or  words)  based  on  the  past  performance  of  the  segmenter-labeler.  As  is 
pointed  out  in  Chapter  2,  the  confusion  measure  can  be  interpreted  as  measuring,  for  a 
set  of  competing  hypotheses,  the  equivalent  number  of  hypotheses  rated  the  same  as 
the  best  hypothesis.  Figure  5.4  traces  the  confusion  of  the  hypotheses  from  the 
segment  level  through  14  steps  of  the  recognition  algorithm  up  through  the  word  level. 
The  first  number  for  each  set  of  competing  hypotheses  gives  the  average  confusion 
measured  for  the  set  at  the  particular  point  in  the  recognition  algorithm.  The  number  in 
parentheses  gives  the  average  number  of  hypotheses  in  the  set.  For  example,  at  the 
bottom  "14(98)"  means  that  98  segment  labels  are  given  as  input  to  Noah  for  each 
segment  (seven  segments  are  represented  by  boxes)  and  on  the  average  the  ratings  for 
the  labels  are  such  that  the  confusion  is  14,  i.e.,  the  equivalent  number  of  equally  rated 
labels  is  14.  A brief  description  of  the  recognition  step  is  given  on  the  left  of  the 
figure  and  a symbolic  representation  is  on  the  right.  For  example,  on  the  average  there 
are  13  segment-labels  from  the  segmenter-labeler  which  pass  a threshold  (i.e.,  they 
have  a good  enough  rating  according  to  a threshold  criterion).  Their  ratings  are  such 
that  they  have  an  average  confusion  of  6. 

One  tasK  of  the  recognition  algorithm  is  to  reduce  the  confusion  of  the  segment 
labels  by  applying  the  constraints  of  its  description  of  words.  We  see  from  Figure  5.4 
that  the  segment  label  confusion  is  reduced  (or  remains  the  same)  in  every  step  except 
for  three.  These  three  are:  a)  Step  2:  joining  the  labels  according  to  the  sylpart  syntax, 
b)  Step  4:  interpreting  the  segment  patterns  — making  sylpart  hypotheses  based  on  the 
patterns,  and  c)  Step  8:  joining  the  onset,  vowel,  and  coda  hypotheses  to  form  syllables. 
In  Step  the  sylpart  syntactical  constraints  do  in  fact  reduce  the  confusion  from  what 
it  would  have  been  if  every  label  of  a segment  could  be  joined  with  every  label  of 
adjacent  segments  to  form  patterns.  In  Section  2.4,  we  saw  that  the  confusion  of  a set 
of  hypotheses  at  one  level  formed  from  all  combinations  of  the  hypotheses  in  two  sets 
of  adjacent  hypotheses  at  a lower  level  equalled  the  product  of  the  confusions  for  the 
two  sets.  Generalizing  this  to  more  than  two  sets,  we  can  estimate  the  confusion  of  set 
of  hypotheses  at  one  level  based  on  the  confusion  of  the  sets  at  a lower  level  when  no 


64 


t(20) 


13L  OMat*  duplicatM 

"77“ 

Ward  Laval  ^ 

16  (41) 

IZ  Langth  nermalizalion 

..A.... 

and  mraanold  £ 

^ 21  (63) 

IL  Join  by  adjaconcy  and  ^ 

SS  (12S) 

10.  Oalata  duplicataa 

A 

Syllabln  Laval 

62(145) 

9*  Thrasbofd  ••• 

8.  Join  bv  adiacancv  and 
ayntai  eonstrainta  - 

77(207) 

Onaal 

Ve$al 

^ Coda 

12(18) 

11  (18) 

13  (21) 

7.  Oalata  dupliealaa 

A 

A 

A 

16(27) 

24  (44) 

17(29) 

6b  Thrathold  ••• 

n 

i i 

1 

Sylpart  l^avdl  ^ Chatk  contail  and 

add  waight  panalty 

25(67) 

41  (104) 

24(63) 

4- 

4- 

4L  Pattam  to  sylpart 
avpanaton 

43(67) 

47  (104) 

40(63) 

\Y 

X Langth  normalization 

11  (18) 

20(45) 

14  (23) 

4\ 

ik 

n 

and  thraihoM 

2.  Join  by  syntaa 

• (13) 

Sanmant  Laval  1.  Thraalnld  _ 

11  (2i) 

21  Q9) 

14  (26) 

/ t \ / \ t 

^ ^ ^ 



Figura  S.4:  Propa(ation  e(  Cantuiian  During  Racognitien 


syntactic  constraints  are  applied.  For  example,  the  average  number  of  segments  for  an 
onset  sylpart  is  1.7.  The  expected  average  confusion  for  a set  of  competing  onsets 
when  no  syntactical  constraints  are  applied  is  the  measured  confusion  of  a segment  (6) 
raised  to  the  1.7  power,  or  21  {-  6^*^).  Comparing  this  value  to  the  actual  average 
confusion  of  11  measured  for  an  onset,  we  see  that  the  learned  syntactic  constraints 


5 Recognition 


65 


reduced  the  confusion  by  about  one-half.  In  the  same  way,  the  expected  confusion  for 
Step  S ( for  the  syllables)  is  approximately  400  (-  12^*^).  The  adjacency  and  syntactic 
constraints,  used  to  put  the  onset,  vowel,  coda  hypotheses  together,  reduce  this  number 
to  77. 

The  increase  in  the  average  confusion  for  Step  4 of  the  recognition  algorithm  is 
easily  explained.  Since  one  segment  pattern  may  represent  several  sytparts,  an 
increase  in  the  number  of  competing  hypotheses  occurs  when  the  algorithm  makes 
sylpart  hypotheses  from  the  recognized  segment  patterns.  For  example,  the  onsets  T” 
and  T"  share  the  segment  pattern  PH]".  When  that  pattern  is  recognized,  the  onset 
"P"  and  the  onset  T"  are  each  hypothesized.  The  increase  in  competing  hypotheses 
causes  a corresponding  increase  in  the  average  confusion.  The  increased  confusion 
introduced  by  the  multiple  interpretations  of  segment  patterns  in  this  step  is  reduced  in 
part  by  the  use  of  context  and  the  weight  penalty  in  the  next  step. 

There  are  two  sources  of  confusion  for  the  algorithm.  The  first  is  from  the 
segment-label  input  to  the  algorithm;  the  second  is  from  the  multiple  interpretations  of 
the  segment  patterns,  as  seen  in  Step  4.  The  second  source  is  due  to  the  ambiguity  of 
speech  stored  in  the  segment-sylpart  trees  and  acquired  from  the  segmented  and 
labeled  training  utterances. 

The  final  result  of  the  recognition  algorithm  according  to  the  confusion  measure 
is  20  competing  word  hypotheses  per  utterance  word  with  ratings  such  that  the 
equivalent  number  of  equally  likely  word  hypotheses  is  8. 

We  have  used  the  confusion  measure  to  give  another  view  of  the  recognition 
algorithm.  Much  more  could  be  done  with  this  measure.  For  example,  it  could  be  used 
to  see  the  effect  of  adjusting  various  thresholds  throughout  the  algorithm;  it  could  be 
used  to  see  where  in  the  algorithm  larger  vocabularies  cause  greater  confusion;  and  it 
could  be  used  to  analyze  the  effect  of  training.  Time  did  not  permit  a more  thorough 
application  of  the  measure. 


Chapter  6;  Results  and  Analysis 


6.1  Introduction 

This  chapter  is  separated  into  three  main  sections:  performance  and  runtime 
characteristics,  analysis  of  performance,  and  a comparison  with  two  other  word 
hypothesizers.  Before  performance  is  discussed  we  need  to  describe  how  performance 
is  measured  and  the  conditions  under  which  Noah  was  trained  and  tested. 

The  difficulty  with  reporting  performance  for  a word  hypothesizer  is  that  the 
measure  of  performance  is  closely  connected  with  the  characteristics  of  the  speech 
system  in  which  the  hypothesizer  is  used.  For  example,  in  a speech  system  which  uses 
only  the  best  correct  bottom-up  word  hypothesis  to  begin  a top-down  (grammar 
restricted)  search  for  the  rest  of  the  words  of  the  utterance,  the  relevant  performance 
is  the  rating  of  the  best  correct  word  hypothesis  relative  to  the  ratings  of  all  other 
hypotheses.^  The  goal,  in  this  case,  is  always  to  rate  a correct  word  hypothesis  better 
than  all  other  hypotheses  and  to  minimize  the  number  of  other  hypotheses. 
Performance  has  been  measured  for  Noah  with  a more  bottom-up  speech  system  in 
mind.  The  goal  for  a word  hypothesizer  in  such  a system  is  to  hypothesize  all  of  the 
correct  words  and  no  others.  (If  such  a goal  were  reached,  the  rest  of  the  system 
would  have  little  to  do.)  Thus,  the  relevant  performance  is  given  by  1)  the  number  of 
words  hypothesized,  2)  the  number  of  correct  words  hypothesized,  and  3)  the  ratings  of 
the  correct  hypotheses  relative  to  the  ratings  of  the  incorrect  ones.  The  next  section 
describes  in  detail  what  measurements  of  performance  are  used. 

6.1.1  Measursmenta  of  Performance 

6.1. 1.1  Word  Accuracy  and  Average  Rank 

The  first  question  to  answer  is  'What  is  a correct  word  hypothesis?"  All  test 
utterances  (105)  were  hand  segmented  into  words  — that  is,  the  words  of  the  spoken 


1 The  performance  of  Noah  will  be  measured  by  this  method  when  it  is  compared  to  the 
Lexical  retrieval  component  of  the  HWIM  system  - a system  using  the  above  strategy 
for  recognition. 


utterance  were  given  begin  and  end  segment  numbers  by  inspecting  the  segment  labels 
and  the  speech  waveform.  (Of  course,  this  hand-segmentation  is  used  only  for 
performance  analysis  — Noah  cannot  access  it  while  doing  its  hypothesization.)  A word 
hypothesis  matching  a hand-defined  correct  word  in  name  and  having  a begin  segment 
within  2 segments  of  the  "correct*  begin  segment  and  an  end  segment  within  2 
segments  of  the  "correct"  end  segment  is  considered  to  be  a correct  word  hypothesis. 
The  ratings  of  a correct  word  hypothesis  relative  to  the  incorrect  ones  is  measured  by 
computing  the  rank  of  the  correct  hypothesis.  The  rank  of  an  hypothesis  is  defined  to 
be  the  number  of  competing  hypotheses  rated  better  than  it,  plus  one-half  the  number 
of  competing  hypotheses  rated  the  same  as  it,  plus  one  (i.e.,  a rank  of  one  is  the  best 
rank  - no  hypotheses  rated  better).  Two  hypotheses  are  considered  to  compete  if  the 
amount  of  their  overlap  in  time  is  greater  than  one-half  the  duration  of  the  shorter 
hypothesis.  For  example,  in  Figure  6.1  hypothesis  A competes  with  hypotheses  C,  D, 
and  E (and  vice-versa)  but  not  with  B.  This  is  definition  is  somewhat  arbitrary  and  it 
may  result  in  calling  two  hypotheses  "competing  hypotheses"  which  a speech  system 
would  not  consider  as  competing.  However,  it  is  the  definition  used  by  P0K40W  and  is 
used  here  to  permit  a comparison  of  the  word  hypothesizers. 


I D 

E 

Time  — > 

Figure  6.1:  Competing  and  Noncompeting  hypotheses 


Added  to  the  above  tests  for  a correct  hypothesis  is  the  restriction  that  its  rank 
be  less  than  or  equal  to  20.  Though  this  threshold  of  20  is  somewhat  arbitrary,  it  was 
thought  that  no  speech  system  could  afford  to  look  deeper  than  about  20  words  for  the 
correct  word  in  the  bottom-up  word  hypotheses  at  any  place  in  the  utterance.  This 
restriction  is  similar  to  stating  that  no  more  than  20  words  are  hypothesized  for  each 
utterance  word.  The  performance  of  Noah  can  now  be  given  by  two  numbers:  the 


r 


percent  of  the  utterance  words  correctly  hypothesized  and  the  average  rank  of  the 
correct  word  hypotheses.  The  goal  is  therefore  100!!  word  accuracy  at  an  average 
rank  of  1. 


6.1. 1.2  Average  Efficiency 

In  an  attempt  to  reduce  the  measure  of  performance  to  one  number  which  could 
be  used  to  monitor  the  progress  in  developing  the  word  hypothesizer  and  to  compare  it 
across  different  vocabulary  sizes,  the  measure  of  "average  efficiency"  was  defined.  The 
average  efficiency  measures  a weighted  word  accuracy  of  the  hypothesizer  by 
weighting  each  correct  word  hypothesis  according  to  its  rank.  The  average  efficiency 
for  a set  of  utterances  is  given  by: 

Average  Efficiency  ■ 1/n  * l/R:, 

ISiSn 

expressed  as  a percent,  where  R:  is  the  rank  of  the  ilh  correct  word  hypothesis  and  n 
is  the  total  number  of  words  in  the  test  utterances.  If  the  ith  correct  word  is  missed, 
its  rank  is  taken  to  be  infinity.  This  equation  is  equivalent  to  computing  a weighted 
accuracy  by  counting  all  the  correct  hypotheses  at  rank  1,  1/2  of  the  correct 
hypotheses  at  rank  2,  ...  and  1/n  of  the  correct  hypotheses  at  rank  n.  Thus,  the 
average  efficiency  varies  from  0/i  for  no  correct  hypotheses  to  1007.  for  a perfect  word 
hypothesizer.  The  term  "efficiency"  was  chosen  because  the  measure  is  the  ratio  of 
work  output  (one  correct  hypothesis)  to  the  work  input  (producing  another  hypothesis) 
at  each  place  in  the  utterance  if  the  v/ord  hypothesizer  is  used  as  a word  generator.  A 
word  generator  produces  the  next  best-rated  word  hypothesis  at  any  place  in  the 
utterance  as  requested  by  the  rest  of  the  speech  system.  Averaging  this  ratio  over  a 
set  of  utterarKes  gives  the  "average  efficiency". 

Two  criticisms  can  immediately  be  made  about  this  measure.  First,  it  says 
nothing  about  the  number  of  hypotheses  made.  This  can  be  answered  by  remembering 
that  at  no  time  do  more  than  the  top  20  ranks  need  be  considered  in  searching  for  the 
correct  word.  Second,  the  average  efficiency  is  a harsh  performance  measure.  It  puts 
a lot  of  importance  on  hypothesizing  the  correct  word  in  the  top  rank,  but  whether  2 
correct  word  hypotheses  at  rank  2,  or  5 correct  word  hypotheses  at  rank  5 are  better 
than  only  1 correct  word  hypothesis  at  rank  1 depends  on  (he  speech  system  using  (he 
word  hypothesizer  and  on  the  complexity  of  the  language.  At  any  case,  the  average 
efficiency  measure  is  insensitive  to  correct  hypotheses  in  the  higher  ranks.  This  is  not 
true  for  (he  word  accuracy  or  average  rank  measures.  For  example,  if  70  out  of  100 
words  are  hypothesized  correctly  at  an  average  rank  of  3 and  an  average  efficiency  of 
407  and  then  one  more  correct  hypothesis  is  made  at  rank  20,  the  word  accuracy 
increase  by  17  and  the  average  rank  by  .24,  but  the  average  efficiency  increases  only 
by  0.057. 


70 


For  completeness,  we  will  also  show  the  accumulative  word  accuracy  at  the  best 
5 ranKs.  That  is,  the  word  accuracy  when  only  the  first  rank  is  considered,  when  only 
the  first  two  ranks  are  considered,  and  finally,  when  only  the  first  five  ranks  are 
considered. 

6.1. 1.3  Summary  of  Performance  Measures 

/ 

We  summarize  here  some  of  the  terms  and  performance  measures  given  above; 
the  performance  measures  are  underlined: 

Correct  Word  Hypothesis:  A word  hypothesis  which  matches  an  utterance 
word  in  name  and  in  position  in  the  utterance.  The  position 
matches  if  the  begin  and  end  segments  of  the  hypothesis  are 
each  within  two  segments  of  the  corresponding  begin  and  end 
segments  of  the  utterance  word. 

Competing  Hypotheses:  Two  hypotheses  whose  overlap  in  time  is  greater 
than  one-half  the  duration  of  the  shorter  hypothesis. 

The  Rank  of  an  Hypothesis:  The  number  of  competing  hypotheses  rated 
better  than  the  hypothesis  plus  one-half  the  number  rated  the 
same  plus  1. 

Word  Accuracy:  The  percent  of  utterance  words  for  the  test  data  having 
Correct  Word  Hypotheses  at  a rank  less  than  or  equal  to  20. 

Average  Rank:  The  average  rank  of  the  Correct  Word  Hypotheses  for  the 
test  data. 

Average  Efficiency:  A weighted  Word  Accuracy  computed  by  summing 
l/(The  Rank  of  a Correct  Word  Hypothesis)  for  alt  Correct  Word 
Hypotheses  and  dividing  by  the  total  number  of  utterance  words. 

Word  Accuracy  for  the  best  M ranks:  The  Word  Accuracy  if  hypotheses 
are  limited  to  the  best  M ranks.  (Word  Accuracy  above  is  for  the 
best  20  ranks.) 

6.1.2  Training  and  Testing  Conditions 

Noah  was  trained  on  174  utterances  (about  1600  syllables)  that  had  been  hand- 
segmented  into  sylparts,  as  described  in  Chapter  4.  (This  set  was  repeatedly  divided  to 
obtain  the  smaller  training  sets).  A different  set  of  105  utterances  (705  words)  made 
up  the  test  sentences.  The  test  sentences  had  been  looked  at  only  to  hand-segment 
them  into  words  before  testing.  All  sentences  were  spoken  by  the  same  speaker  in  a 
quiet  room  with  a close-speaking  microphone  in  sessions  spread  out  over  a period  of 
several  months.  The  word  hypothesizer  was  adjusted  for  best  performance  using  the 


L 


o fwtunt  ana  Hnaiyan 


1000-word  vocabulary^  and  then  tested  on  the  other  size  vocabularies  without  further 
adjustment.  A 500-word  vocabulary  was  made  from  a subset  of  the  1000-word 
vocabulary  which  still  included  the  268  words  of  the  test  sentences.  The  larger 
vocabularies  were  formed  by  adding  to  the  1000-word  vocabulary  subsets  (every  Nth 
word)  of  the  20,000-word  dictionary.  No  word  in  this  dictionary  was  included  in  the 
test  vocabularies  if  it  had  aiready  appeared  in  the  1000-word  vocabulary  or  if  it 
included  sylparts  which  did  not  appear  in  the  training  utterances.  Thus,  in  ali  tests, 
each  word  in  the  vocabulary  was  unique  and  each  had  the  potential  of  being 
hypothesized. 


6.2  Performance  and  Runtime  Characteristics 

This  section  contains  a description  of  Noah’s  performance;  an  analysis  of  that 
performance  is  given  in  the  following  section. 

6.2.1  Performance  versus  Word  Vocabulary  Size 

The  graphs  of  Figure  6.2  give  Noah’s  performance  by  three  measurements  (word 
accuracy,  average  correct  word  rank,  and  average  efficiency)  as  a function  of 
vocabulary  size.  Word  accuracy  is  seen  to  drop  from  732  to  582  as  the  vocabulary 
increases  from  500  words  to  19,000  words.  At  the  same  time  the  average  rank  of  the 
correct  word  hypotheses  climbs  from  2.6  to  5.8  for  the  same  increase  in  vocabulary 
size.  The  efficiency  measure  shows  a smooth  decline  in  performance  which  is 
approximately  logarithmic  in  vocabulary  size.  (A  plot  of  a logarithmic  curve  is  also 
given). 

Figure  6.3  gives  the  word  accuracy  in  the  top  5 ranks.  From  these  graphs  it  is 
easy  to  see  the  effect  on  the  accuracy  of  limiting  Noah  to  hypothesizing  only  the  best  M 
words  at  each  point  in  the  utterance  (for  M-l,2,...,5). 

6.2.2  Performance  versus  Training  Sample  Size 

The  performance  of  Noah,  again  given  by  three  measures,  is  plotted  against  the 
total  number  of  sylpart  paths  learned  from  different  sizes  of  training  sets  in  Figure  6.4 
for  the  1000  and  the  8000-word  vocabularies.  The  number  of  t/ylpart  segment  paths 
rather  than  the  total  number  of  sylpart  samples  is  displayed  on  the  x-axis  to  show  the 
effect  that  new  information  has  on  the  performance.  For  both  vocabularies  the  word 
accuracy  as  well  as  the  average  rank  of  the  correct  hypotheses  is  seen  to  rise  as  the 


2 The  "lOOO-word"  vocabulary  is  the  one  used  by  the  Hearsay  II  system.  Actually  this 
vocabulary  is  made  up  of  1011  words.  The  other  vocabularies  consisted  of  508,  2013, 
4020,  8032,  16,025,  and  19,008  words. 


Word  Accur 


Avaracs 


Avaraja  EffldaiKy 


Plot  of  L6(  Curv* 


EilicitnevI  of  Nooh 


VoeabulMy  Sm  (xlOOO) 


Figure  6.2:  Performance  of  Noah  versus  Vocabulary  Size 

training  increases.  This  results  in  a gradual  increase  in  the  average  efficiency  of  Noah 
for  the  1000-word  vocabulary  but  no  increase  in  the  average  efficiency  for  the  8000- 
word  vocabulary  after  about  1080  sylpart  segment  paths  have  been  learned  (explained 
in  Section  6.3.2). 

6.2.3  Computation  Costs  versus  Vocabulary  Size 

The  Computation  Cost  given  as  millions  of  instructions  executed  per  second  of 
speech  (MIPSS)  is  plotted  in  Figure  6.5  as  a function  of  the  log  of  vocabulary  size  for 


6 Results  and  Analysis 


Vocabulary  Siaa  (xlOOO) 

Figure  6.3:  Accuracy  in  Top  M Ranks  versus  Vocabulary  Size 

different  subparts  of  the  word  hypothesizer.  The  total  cost  increases  at  a logarithmic 
rate  from  2.4  MIPSS  to  6.6  MIPSS  as  the  vocabulary  increases  from  500  words  to 
19,000  words.  (This  is  an  increase  of  about  0.75  MIPSS  per  doubling  of  the 
vocabulary.)  The  time  for  processing  speech  times  real-time  (i.e.,  the  computation  time 
divided  by  the  time  to  speak  the  utterance)  is  shown  by  a scale  to  the  right  for  the 
machine  used  in  these  tests  --  a Digital  Equipment  Corporation  PDP-KLIO  which  is  about 
a 1.3  million  instruction  per  second  machine.  The  effect  of  training  on  total  computation 
costs  is  small.  Over  the  range  of  training  sets  used,  the  change  in  computation  cost  was 
about  0.5  MIPSS. 


6.2.4  Breakdown  of  Storage  Costs 

The  Noah  word  hypothesizer  (not  including  segment  pattern  learning  program  or 
the  dictionary  processing  program)  requires  51K  of  36-bit  words  of  storage  containing 
code  for  recognition,  debugging,  analysis  and  statistics,  and  runtime  constants  and 


Numbar  of  Sy4ptri  S«|mtnt  Piths 
Figure  6.4:  Performance  of  Noah  versus  Training 


variables.  A variable  amount  of  storage  is  used  for  the  tree  structures  containing  the 
Knowledge  of  the  system  and  for  the  structures  containing  the  hypotheses  at  segment, 
sylpart,  syllable,  and  word  levels.  The  size  of  this  storage  depends  on  the  amount  of 
training,  the  size  of  the  word  vocabulary,  and  the  length  of  the  utterance.  The 
breaKdown  of  storage  costs  for  the  hypothesizer  working  on  a 51  segment  utterance 
(about  3 seconds)  with  a 1000-word  vocabulary  and  the  full  training  set  used  here  (174 
training  utterances)  is  as  follows: 


6 Results  and  Analysis 


IS  19 


Vecabulay  Sin  (x  1000)  - Lo(  Seal* 


Figure  6.5:  Computation  Cost  versus  Vocabulary  Size 


1000-uord 
vocabu I ary 
51K 
13K 


19,000-uord 

vocabulary 

51K 

13K 


Code  and  fixed  storage: 

Hypotheses  for  utterance: 

Segment-ay  I part  tree 

(training  dependent): 

Sy I par  t-sy liable  and  syllabi e-uord 
tree  (vocabulary  dependent) : 


Total  storage  requirement:  87K  224K 

Note  that  the  storage  cost  for  the  vocabulary-dependent  knowledge  increases  to 
148K  (13  times  greater)  for  the  19,000-word  vocabulary.  It  would  be  quite  easy  to 
implement  a paging  algorithm  for  the  syllable-word  tree,  the  major  part  of  the  148K. 
(The  tree  was  implemented  with  a paging  algorithm  in  mind,  but  paging  was  not  found  to 
be  necessary  with  the  amount  of  primary  memory  available.)  Storage  costs  as  a 
function  of  training  and  vocabulary  size  have  been  given  in  Chapter  3. 


6.3  Analysis 

In  this  section  the  performance  is  investigated  from  several  different  angles. 
The  first  and,  to  a lesser  degree,  the  second  parts  address  the  issue  of  performance 
degradation  as  the  vocabulary  size  increases.  The  remaining  parts  of  the  analysis  look 
at  the  performance  before  it  degrades  — i.e.,  with  a 1000-word  vocabulary.  This 
analysis  will  permit  us  to  suggest  improvements  (given  in  the  last  chapter). 


4^ 


Better 

Hypotheses 


r ' 'KSaSE  5 


THESE  20 

(Z 

"MTs  1 

KEYS  24 

□ 

1 MU 

1 HE  30 

J ' HfeLt*  18"  1 I' 

MAY  22 

1 KEY  31  1 

1 FELDMAN  22 

1 

LEE  32 

1 UP  23  ] 

AN  33 

1 IS  34 

HOW  31  1 

I 35 

1 2 3 4 5 6 7 8 9 10  11  12  13  14  15 


CELL 

12 

I so 

i! 1 

Segment  Number 


Figure  6.6A:  All  Competing  Hypotheses  for  "Please  Help  Me”, 
1000-Word  Vocabulary 


6.3.1  Effect  of  Vocabulary  Size  on  Performance 

Performance  degrades,  as  expected,  as  more  words  compete  for  hypothesization 
in  the  utterance.  However,  by  one  measure  — the  average  efficiency  — the 
performance  degrades  only  by  a factor  proportional  to  the  log  of  the  vocabulary  size 
over  the  range  of  vocabularies  tested.  Figures  6.6A  and  6.6B  show  how  the 
performance  of  an  almost  perfectly  recognized  utterance  using  the  1000-word 
vocabulary  drops  when  the  16,000-word  vocabulary  is  used.  The  average  rank 
increases  from  1.7  to  4.2,  while  the  average  efficiency  drops  from  78^  to  70^ 

In  most  cases,  for  this  example,  the  incorrect  competing  words  from  the  16,000- 
word  vocabulary  are  probably  those  which  a human  would  recognize  if  he  heard  only 
the  corresponding  part  of  the  acoustics.  What  is  missing  from  the  word  hypothesizer  Is 
the  ability  to  recognize  syllable  and/or  word  boundaries.  (The  possibility  of  doing  this 
is  discussed  in  Chapter  7). 

Why  is  the  performance  degradation  as  measured  by  the  average  efficiency 
approximately  proportional  to  the  logarithm  of  the  vocabulary  size?  Though  we  can  not 


6 Results  and  Analysis 


77 


A 


Better 

Hypotheses 


1 PLEA§E  5 

1 oPTe 

PLEA  IZ 

1 SOAP  e 1 

1 Cell  iz  I 

Sell  iz 

rnn  1 

so  iz 

ID 

A 9 

1 EASE  IZ 

□ 

7579 1 

c 

— gETIa r 

'MlEWii  1 

c 

■ Sole  i4 

IM  l4  "1 

c 

pEasTIJI 

n r«iT  iT  ' - 

Inn  14  1 

[ ‘ .»  1 

1 FOE  15  1 

[-E16-  1 

c 

YhesE  zo 

-]  EoAL  16  1 1 

MEAT  19  1 

C 

SOUGH  18  P 

MEET  19  1 

c: 

tease  “zs 

1 HELP  18  1 

METE  19  1 

c 

kEys  za 

HELPMATE  19 

_ _ 

HOPE  19 

IAAV  22 

1 eJ5T  z? 

1 0 20 

Id  24  1 

ftkZt  25 

FELDMAN  22 

1 

1 LEES  Z5 

MOPE  22 

MITT  Z5  1 

THEE  27 

DOPE  22  1 

MEN  ze  1 

LEACH  28 

UP  23 

MAIZE  27 

11 

tEE  30 

SOAK  Z6  1 

MY  28  1 

□ 

T 30 

HOE  26  P 

MAIO  30 

c 

Tea  30  1 

1 OAT  26  1 

MAbE  30 

c 

HE  30  1 

1 hole  27  P 

MUD  30  1 

c 

PLAY  31  1 

1 FULL  27  ( 

(I 

KEY  31  1 

FOLK  28  1 

c 

CAY  31  1 

DOUGH  29  1 

r 

PLACE  31 

1 BOATMAN  32 

1 

1 

2 3 4 5 6 

7 8 9 10  ll  12 

13  14  15 

Segment  Number 


Figure  6.6B:  The  Top  Competing  Hypotheses  for  "Please  Help  Me" 
1 6,000-Word  Vocabulary 


answer  this  question  completely,  we  can  suggest  some  reasons.  However,  we  must 
caution  that  the  average  efficiency  is  only  one  measure  of  performance;  some  other 
measure  might  give  a different  rate  of  performance  degradation.  In  particular,  it  is  not 
known  how  a speech  system’s  performance  might  react  to  a logarithmic  degradation  in 
the  average  efficiency  of  a word  hypothesizer  for  the  range  of  vocabularies  tested. 


78 


The  first  reason  that  comes  to  mind  for  the  logarithmic  degradation  is  that  the- 
Noah  word  hypothesizer,  as  measured  by  the  average  efficiency,  is  simply  reacting  to  a 
characteristic  of  English  words.  Consider  a "word-sound  similarity"  space.  The  average 
efficiency  performance  degradation  would  be  logarithmic  if  the  density  of  the  words  in 
the  space  increases  a constant  amount  every  time  the  number  of  words  in  the  space 
doubles.  The  average  efficiency  measure  is  mainly  due  to  the  word  accuracy  for  the 
first  rank  position  (about  80^  of  it  for  the  500-word  vocabulary  to  about  50^  for  the 
19,000-word  vocabulary).  The  curve  of  the  first  rank  word  accuracy  is  also 
approximately  logarithmic  in  vocabulary  size,  decreasing  about  57.  for  every  doubling  of 
the  vocabulary  size  (see  Figure  6.3).  This  means  that  about  35  (<■57  of  the  705  test 
words)  of  the  correct  word  hypotheses  drop  out  of  the  first  rank  position  for  every 
doubling  of  the  vocabulary  size.  Thus,  only  a constant  and  small  number  of  words  are 
similar  enough  to  the  first  rank  words  to  cause  confusion,  even  though  the  total  number 
of  words  doubles  each  time.  Of  course,  for  the  more  poorly  ranked  word  hypotheses 
which  do  not  match  the  segment  labels  as  well,  more  words  would  be  found  in  the  larger 
vocabularies  to  cause  confusion.  However,  the  average  efficiency  measure  is  less 
sensitive  to  increases  in  the  rank  of  these  more  poorly  ranked  correct  word 
hypotheses. 

It  seems  that  for  the  vocabulary  sizes  tested,  words  in  the  "word-sound 
similarity"  space  are  quite  spread  out.  This  is  not  entirely  due  to  the  multisyllabic  word 
differences;  much  of  it  is  due  to  different  syllables.  We  found  that  the  first  1000  words 
put  in  the  knowledge  representation  added  about  one  new  syllable  for  eachword.  The 
last  1000  words  added  for  the  19,000  word  vocabulary  added  about  one  new  syllable 
for  every  nine  words.  The  rate  of  adding  new  syllables  decreased  as  expected,  but 
many  new  syllables  were  still  found  in  the  last  1000  words. 

The  problem  with  the  above  reasoning  is  that  this  "word-sound  similarity"  space 
is  defined  by  the  knowledge  and  knowledge  representation  of  the  Noah  word 
hypothesizer;  this  knowledge  at  the  lowest  level  is  learned  from  a limited  set  of  training 
samples.  So  on  a less  positive  note,  we  must  consider  the  possiblity  that  the  excellent 
logarithmic  degradation  as  the  vocabulariy  size  is  increased  is  because  the  training  and 
test  utterances  were  both  based  on  the  1000-word  vocabulary.  Although  every  sylpart 
in  the  19,000-word  vocabulary  had  training  samples,  one  might  suspect  that  the  the 
number  of  training  samples  was  biased  towards  those  sylparts  appearing  in  the  1000- 
word  vocabulary,  or  in  particular,  toward  those  sylparts  appearing  in  the  268  words  of 
the  test  utterances.  This  is  partly  true.  However,  the  bias  in  number  of  training 
samples  is  not  very  great:  using  the  number  of  training  samples  for  the  least-trained 
sylpart  of  each  word  as  a measure  of  the  amount  of  training  for  the  sylparts  of  a word, 
the  average  number  of  sylpart  training  samples  for  the  words  in  the  268-word 


6 Results  and  Analysis 


79 


vocabulary  (the  words  of  the  test  utterances),  the  1000-word  vocabulary,  and  the 

16.000- word  vocabulary  was  25,  19,  and  17,  respectively.  We  suspect  that  a more 
prevalent  bias  is  due  to  the  context  in  which  the  sylparts  were  learned.  Section  6.3.5 
discusses  how  word  accuracy  is  effected  by  training  on  the  words  of  the  test  sentences 
as  opposed  to  training  on  the  sylparts  of  the  words.  Including  a test  word  in  the 
training  guarantees  that  its  sylparts  will  be  learned  in  the  appropriate  context. 

At  any  rate,  we  believe  that  any  bias  in  the  training  affects  only  the  rate  of  the 
logarithmic  performance  degradation,  not  the  fact  that  it  is  logarithmic.  The  average 
efficiency  curve  is  seen  to  be  approximately  logarithmic  over  the  range  of  vocabulary 
sizes  from  2000  words  to  19,000  words.  These  words  were  randomly  chosen  from  one 

20.000- word  dictionary  so  that  each  vocabulary  has  the  same  (possible)  training  bias. 

We  attribute  the  logarithmic  increase  in  the  computation  cost  with  the 
vocabulary  size  to  the  tree  searching  done  by  the  recognition  algorithm.  The  cost  of 
searching  In  trees  storing  information  like  that  found  In  the  sylpartsyllable  and 
syllable-word  trees  is  logarithmic  in  the  number  of  terminal  nodes.  (See  [Knuth  - 1973], 
Vol.  3,  pp.  499.)  We  believe  the  rate  of  this  logarithmic  increase  has  been  reduced  by 
using  two  levels  of  trees  to  store  the  dictionary  knowledge.  (The  segment-sylpart  tree 
is  independent  of  the  vocabulary  size,  as  is  indicated  by  the  constant  computation  cost 
shown  for  this  tree  in  Figure  6.5.) 

6.3.2  Effect  of  Training  on  Performance 

Figure  6.4  has  shown  that  more  training  of  the  type  used  here  Is  not  the  answer 
to  better  performance  for  larger  vocabularies.  Let  us  consider  each  sylpart  segment 
pattern  to  be  a point  in  some  D-dimensional  space  with  some  distance  metric.  We  can 
view  the  hypothesizer  as  if  it  used  the  nearest-neighbor  algorithm;  i.e.,  its  job  is  to  find 
the  closest  labeled  point  (corresponding  to  a learned  segment  pattern)  to  a new 
unknown  point  (corresponding  to  a segment  pattern  in  the  test  sentences)  in  order  to 
identify  the  new  point.  By  training  on  more  utterances,  new  labeled  points  are  added  to 
the  space  for  labels  which  have  little  or  no  training,  but  at  the  same  time  many  points 
are  added  to  the  space  for  frequently  occuring  labels.  As  bad  samples  occur,  the 
volume  covered  by  the  points  of  these  labels  increases,  causing  more  and  more 
confusion  during  recognition,  the  possibility  for  more  confusion  increases  with 
vocabulary  size.  This  explains  why  the  performance  of  Noah,  according  to  the  average 
efficiency  measure,  reaches  a maximum  with  less  training  for  the  larger  8000-word 
vocabulary. 

Two  features  of  Noah  are  designed  to  counteract  the  above  problem:  pattern 
merging  (discussed  in  Section  4.3.2)  and  the  weight  penalty  (discussed  in  Section  5.3). 
It  is  possible  that  these  heuristics  need  to  be  adjusted  for  larger  vocabularies.  A third 


1 


1 

1 


80 


solution  for  this  problem  is  selective  training  — that  Is,  learning  additional  segment 
patterns  only  for  the  sylparts  for  which  errors  occur.  This  can  be  done  either  by 
recording,  hand-segmenting,  and  training  on  phrases  containing  the  sylparts  needing 
training  or  by  automatic  learning.  We  will  discuss  the  potential  for  automatic  learning  in 
Chapter  7. 

The  anomaly  in  the  average  rank  plots  ~ the  drop  for  the  final  data  points  — 
can  be  explained  perhaps  by  something  tike  selective  training.  For  the  8000-word 
vocabulary,  the  learning  of  200  more  sylpart  segment  paths  (from  about  1710  to  1910 
segment  paths)  resulted  in  a an  improvement  of  0.3  in  the  average  rank.  About  202  of 
this  drop  (0.06)  is  due  to  21  new  correct  word  hypotheses  at  an  average  rank  of  6.4 
and  a loss  of  7 correct  word  hypotheses  at  an  average  rank  of  14.  The  remaining  802 
(0.24)  is  due  to  an  overall  shift  in  the  rank  of  the  word  hypotheses  which  were  correct 
for  both  training  sets.  The  20  utterances^  containing  these  200  new  segment  paths 
contain  a higher  percentage  of  the  words  used  in  the  testing  utterances  than  the  other 
training  utterances. 

6.3.3  Error  Analysis  for  Sylpart  Recognition 

In  a study  of  215  syllables  (from  20  utterances),  accuracies  for  vowel,  onset, 
and  coda  recognition  were  found  to  be  792,  912,  and  902  respectiveiy  (see  Figure  5.3). 
In  155  of  these  syllables  (722),  all  sylparts  were  recognized.  In  the  remaining  60 
syllables,  errors  were  due  to: 

Errors  Number  Part  of  Total 


No  syllable  nucleus  found: 

5 

82 

Vowe 1 -sequence  not  recognized: 

17 

282 

Vouel  (and  perhaps  onset 

or  coda)  not  recognized: 

25 

422 

Vouel  recognized  but  onset 

and/or  coda  not  recognized: 

13 

222 

Missing  a sylpart  always  results  in  a missed  word.  Thus,  the  most  likely  cause 
(702  of  the  time)  of  not  recognizing  a word  is  that  one  of  its  vowels  (appearing  alone  or 
as  part  of  a vowel  sequence)  was  not  recognized.  We  expect  that  the  errors  due  to 
vowel  recognition  could  be  cut  in  half  by  more  careful  training  and  using  more 
information  in  the  recognition  algorithm  (such  as  syllable  stress),  as  suggested  in  Section 
7.4.1.  However,  the  variability  found  in  the  vowel  pronunciations  due  to  prosodies, 
context,  carelessness  of  the  speaker,  etc.,  will  always  make  vowels  difficult  to  recognize 
bottom-up. 


3 Set  LLA;  see  Appendix  D for  a list  of  the  training  and  test  utterances. 


6 Results  and  Analysis 


81 


Syllable  nuclei  were  usually  missed  for  a syllabic  /N/’s  as  found  in  the  final 
syllable  of  "written"  and  the  final  syllable  of  "hasn’t".  It  would  be  quite  easy  to  add  to 
the  segmenter-labeler  tests  for  correctly  recognizing  these  kinds  of  syllable  nuclei. 


6.3.4  Effect  of  Word  Length  on  Word  Accuracy 

For  the  1000-word  vocabulary,  the  following  was  observed  for  words  with 
different  numbers  of  syllables: 


Uord  length  in  syllablesi  1 

U uords  (X  of  total):  366  (52%) 

U recognized  (%  of  column):  301  (821) 
U uords  missed  (%  of  tot.):  65  ( 9%) 
U hypothesized 

(%  of  total):  12111  (88%) 

Prob.  hypothesized  uord 

is  correct  for  length:  .02 


2 3 

216  (30%)  85  (12%) 

155  (71%)  48  (56%) 

61  ( 8%)  37  ( 5%) 

1430  (10%)  160  ( 1%) 


.11  .30 


24 

38  ( 5%) 
13  (34%) 
15  .(  2%) 

22  ( 0%) 

.59 


Although  the  longer  words  contribute  a smaller  part  of  the  total  error,  they  are 
more  apt  to  be  correct  (e.g.,  13  out  of  22  (59%)  of  the  four  or  more  syllable  word 
hypotheses  are  correct).  Also,  long  words  are  more  often  strong  content  words;  that  is, 
they  carry  the  most  important  information  of  the  sentence.  The  short  monosyllabic 
words  include  "a",  "the",  "an",  "of",  etc.,  which  do  not  carry  much  of  the  meaning  of  a 
sentence.  Thus,  it  is  clear  that  more  emphasis  on  hypothesizing  longer  words  would 
pay  off. 

6.3.5  Word  Training  versus  Sylpart  Training 

One  of  the  features  of  Noah  which  makes  large  vocabularies  possible  and  at  the 
same  time  permits  changing  the  segmenter-labeler  with  a minimum  of  effort  is  the 
learning  of  segment  patterns  at  the  sylpart  level  rather  than  some  higher  level  (such  as 
the  word  level).  The  claim  was  that  by  properly  handling  coarticulation  effects  at  the 
sylpart  level,  Noah  could  recognize  any  word  in  its  vocabulary,  even  if  the  word  had  r>ot 
appeared  in  its  training.  When  it  was  noticed  that  the  word  accuracy  for  words  not 
occuring  in  the  training  sentences  was  only  50%  compared  with  an  overall  accuracy  of 
73%  (using  the  1000-word  vocabulary),  this  laim  came  into  question.  However,  the 
amount  of  sylpart  training  for  these  words  must  be  studied  before  judgment  is  passed. 

Figure  6.7  shows  the  accuracy  for  the  woro'c  of  the  test  sentences  grouped  in 
columns  by  the  number  of  occurrences  in  the  training  sentences  and  grouped  in  rows 
by  number  of  sylpart  training  samples  for  the  sylpart  of  each  word  having  the  minimum 
amount  of  training.  For  each  word,  the  sylpart  with  the  minimum  number  of  training 
samples  is  thought  to  be  the  most  likely  sylpart  missed  when  attempting  to  recognize 
that  word.  The  range  for  this  number  is  on  the  left-hand  side  of  the  table.  The  groups 


82 


Number  of  Occurences  of  Word  in  Training 


Number 

of 

Occurrences  g^)4 
of 

Least-Trained 

15^21 

Sylpart 

of 

22-»3S 

Word 

in 

Training  ^3® 

5 

21 


0 

l-»3 

4*10 

1U19 

Z20 

ZO 

21/55 

16/38 

5/10 

0/0 

0/0 

42/103 

38t 

422 

502 

— 

• 

412 

19/40 

24/43 

27/34 

7/8 

0/0 

77/125 

482 

562 

792 

882 

622 

4/13 

12/24 

9/12 

30/35 

17/19 

72/103 

312 

502 

752 

882 

892 

702 

23/34 

24/29 

24/28 

12/15 

0/0 

83/106 

682 

832 

862 

802 

- 

782 

12/15 

19/22 

60/66 

80/86 

72/79 

243/268 

802 

862 

912 

932 

912 

912 

79/157 

95/156 

125/150 

129/144 

89/98 

517/705 

502 

612 

832 

902 

912 

732 

Figure  6.7:  Word  Accuracy  as  a Function  ofWord  Training  and  Sylpart  Training 

were  chosen  to  give  about  the  same  number  of  samples  along  each  dimension.  This 
table  seems  to  indicate  that  both  factors,  the  amount  of  sylpart  training,  and  the 
frequency  of  word  training,  contribute  independently  to  word  accuracy.  Testing  on  the 
same  words  used  in  training  guarantees  that  the  sylparts  are  learned  in  the  appropriate 
context.  Thus,  training  on  the  same  word  generally  gives  better  results  than  just 
training  on  the  sylparts  of  the  word,  hlowever,  word  accuracy  is  high  for  those  words 
not  appearing  in  training  but  having  a large  number  of  training  samples  for  their 
sylparts. 

6.3.6  What  Words  Should  be  HypothesizedT 

It  is  generally  thought  that  certain  words  should  not  be  hypothesized  by  a 
bottom-up  word  hypothesizer.  Usually  these  words  are  the  "small  function"  words  of  a 
sentence  such  as  "the",  "and",  "a",  and  "of",  which  do  not  give  much  clue  to  the  content 
of  the  sentence  and  are  often  hypothesized  incorrectly  since  they  often  appear  as 


6 Results  and  Analysis 


83 


subsets  of  other  words.  In  testing  Noah,  all  words  were  permitted  to  be  hypothesized. 
At  this  point  we  can  look  at  how  the  performance  changes  if  certain  words  are 
eliminated  from  the  hypothesizers  vocabulary.  This  will  be  done  without  regard  to  the 
"content-value"  of  a word  (i.e.,  how  much  the  correct  hypothesization  of  the  word  would 
constrain  the  rest  of  the  words  in  the  sentence). 

Obviously,  the  current  performance  could  be  improved  trivially  by  a post  hoc 
elimination  of  all  words  that  were  not  hypothesized  correctly  anywhere  in  the  test 
utterances.  Instead,  considering  only  words  which  were  hypothesized  correctly  at  least 
once  (177  words  out  of  a possible  258),  Table  6.1  shows  the  15  worst  words  when  the 
words  are  ordered  by  the  number  of  times  a word  hypothesized  correctly  divided  by 
the  number  of  times  it  is  hypothesized. 


Hard 

# In 

Tast 

t tilKI 
Cor  r act 

t tllHt 

Hypothailzad 

Rvy.  Rank 
of  Corraet 

i tliMS  batti 
than  Corraet 

ED 

1 

1 

2sa 

18.6 

12 

UP 

2 

1 

238 

4.6 

23 

AN 

4 

4 

616 

2.5 

65 

RAT 

1 

1 

127 

12.6 

9 

IT 

3 

3 

422 

1.7 

23 

A 

5 

3 

328 

2.7 

15 

1*0 

1 

1 

92 

1.8 

7 

OR 

1 

1 

186 

3.8 

4 

DATE 

1 

1 

86 

5.8 

4 

AND 

3 

3 

235 

5.7 

17 

DID 

3 

2 

148 

4.8 

3 

KEN 

1 

1 

67 

1.8 

3 

LEE 

1 

1 

61 

5.8 

3 

ONE 

1 

1 

62 

3.8 

1 

BRV 

1 

1 

58 

8.8 

1 

Tabla  6. It  Uertt  IS  Ueriis  ordered  by 

(#  tlnod  Correct)  / (#  tinec  hypotheiized) 


Similarly,  Table  6.2  shows  the  15  worst  words  when  ordered  by  the  number  of 
times  the  word  was  incorrectly  hypothesized  better  than  the  competing  correct 
hypothesis.  If  the  top  6 words  of  Table  6.1  are  eliminated,  142  of  the  total  hypotheses 
(13723)  of  the  105  tests  utterances  would  be  eliminated  but  only  1.82  of  the  correct 
hypotheses.  If  the  top  5 words  of  Table  6.2  are  eliminated,  the  average  rank  drops 
from  2.9  to  2.4.  However,  these  words  occur  frequently  in  the  test  utterances  so  that 
eliminating  them  reduces  the  word  accuracy  from  732  to  652. 

Clearly,  it  is  the  words  with  very  little  acoustic  constraint  which  cause  most  of 
the  incorrect  hypotheses.  We  found  that  12  of  the  1000-word  vocabulary  (the  ten 
words:  "an",  "at",  "in",  "it",  "the",  "of",  "a",  "to",  "done",  and  "Ann*)  accounts  for  almost 
302  of  the  incorrect  hypotheses.  Vocabulary  words  can  be  divided  into  four  (fuzzy) 
groups  as  follows: 


Uord 

f in 

Toot 

i tiMC 

Corroet 

# tiMC 

Hypotholizod 

Rvy.  Rani: 
ot  Corroet 

i tiMS  bettor 
than  Correct 

RN 

S 

4 

616 

2.5 

65 

IN 

9 

9 

436 

2.6 

37 

OF 

14 

14 

399 

3.5 

34 

RHy 

23 

21 

102 

1.3 

26 

IS 

9 

8 

229 

3.1 

24 

UP 

2 

1 

239 

4.8 

23 

IT 

3 

3 

422 

1.7 

23 

THE 

18 

IS 

405 

3.3 

19 

RND 

3 

3 

235 

5.7 

17 

DOES 

S 

4 

124 

2.5 

16 

R 

5 

3 

328 

2.7 

15 

RRE 

19 

17 

189 

4.5 

13 

TO 

13 

12 

315 

3.5 

12 

I 

11 

11 

245 

3.5 

12 

ED 

1 

1 

250 

10.8 

12 

FIguri  TabU  8.2i  Uoral  15  Uordi  Ordared  by 

Nuabar  of  TImi  Ratid  Battar  than  tha  Corract  Uord 


Acoustic  Constraints 
Ueak  Strong 

Syntactic/Semant tc  Ueak  I.  ("An”)  II.  ("Also") 

"Content-Value"  Strong  III.  ("Ed")  IV.  ("Abstracts") 

An  example  of  a member  of  each  group  is  given  in  parentheses.  A word 
hypothesizer  has  no  trouble  with  hypothesizing  words  from  Groups  II  and  IV;  the  strong 
acoustic  constraints  reduce  the  number  of  misses  for  these  words.  The  problem  is  with 
the  words  of  Groups  I and  IIL  As  shown  in  the  figures  above,  these  words  are  often 
hypothesized  incorrectly.  We  suggest  that  the  word  hypothesizer  use  variable 
acceptance  thresholds  for  words  in  these  groups  so  that  the  words  in  Group  I are 
hypothesized  only  if  they  have  "very  good"  ratings  and  words  of  Group  III  are 
hypothesized  only  if  they  have  "fairly  good"  ratings.  Of  course,  "very  good"  and  "fairly 
good"  would  have  to  be  defined  The  acceptance  threshold  for  each  word  could  be 
based  on  the  product  of  a measure  of  the  content-value  of  the  word  and  the  acoustic 
constraint  of  the  word;  Group  I having  the  strictest  acceptance  threshold  and  Group  IV 
having  the  most  lenient  acceptance  threshold. 

A speech  system  which  searches  the  word  hypotheses  to  find  syntactically  legal 
sequences  of  words  should  assume  the  existence  of  a Group  I word  any  time  it  needs 
one  to  extend  a sequence.  Later,  these  words  can  be  verified  for  the  best  sequences  of 
words. 

The  important  thing  to  note  is  that  the  word  hypothesizer  can  be  controlled  by 
preset  thresholds  or  dynamically  by  a speech  system  to  hypothesize  those  words  most 


6 Results  and  Analysis 


useful  to  the  task  of  understanding  speech. 


6.4  Performance  Comparison  with  other  Word  Hypothesizers 

6.4.1  POMOW-Wizard 

As  described  in  Section  1.3,  the  POMOW  word  hypothesizer  of  Hearsay-II  [Smith 
- 1976]  passes  an  average  of  90  word  hypotheses  for  each  utterance  word  to  the 
Wizard  word  verifier  [McKeown  • 1977]^  to  be  rated.  We  will  treat  the  combination  of 
these  two  knowledge  sources  as  a single  word  hypothesizer  for  a comparison  with 
Noah. 


POnOU-U i zar d Noah 


Performance: 


Uord  Accuracy: 

B4t 

73%  of  Uorda 

Avg.  Number  of  Uord  Hypotheses 
per  Utterance  Uord: 

72 

20  Hypotheses 

Average  Rank  of  Correct  Uords: 

5.6 

2.3  Rank 

Size: 

Storage  for  1000  Uorda: 

37K 

llIC  (36  bits) 

Program  and  other  Storage: 

46K 

76K 

Total: 

83fC 

87K 

Computation  Costs: 

Number  of  Million  Instructions 
Per  Second  of  Speech: 

28 

3.1  nipss 

Times  real-time  for  a PDP-KL10; 

22 

2.7  X Real  Time 

Table  6.3:  Comparison  of  POHOU-Uizard  and  Noah 
for  the  1000-Uord  Vocabulary 


Table  6.3  compares  the  word  hypothesizers  on  three  dimensions.  Noah,  using 
about  the  same  amount  of  space  and  about  one-tenth  the  computation  time,  performs 
much  better  than  POMOW  Wizard® 


4 Wizard  is,  in  effect,  a miniature  Harpy  system  for  individual  words,  rather  than 
complete  utterances.  Its  knowledge  is  the  1000-word  segment-label  network 
dictionary  of  the  Harpy  system,  which  was  built  manually  by  looking  at  samples  of 
segmented  and  labeled  speech. 

5 The  size  comparison  for  the  program  parts  of  the  hypothesizers  is  a little  uncertain. 
Both  hypothesizers  include  debugging  and  analysis  code.  The  code  for  Wizard  (13K 


86 


The  improvement  in  speed  for  Noah  is  attributed  to  a)  its  ability  to  eliminate 
groups  of  poorly  matching  word  hypotheses  by  rejecting  hypotheses  at  the  low  levels 
and  b)  the  efficiency  of  a tree  search  compared  to  the  various  search  strategies  found 
in  POMOW-Wizard.  One-third  of  POMOW-Wizard’s  time  is  spent  in  POMOW  generating 
the  90  word  hypotheses  per  utterance  word;  two-thirds  of  the  time  is  spent  by  Wizard 
to  verify  them. 

We  attribute  the  improvement  in  performance  of  Noah  to  its  greater  amounts  of 
speech  and  segmenter-labeler  specific  Knowledge.  The  lower  word  accuracy  of  64Z  for 
POMOW-Wizard  (as  opposed  to  737  for  Noah)  is  due  mainly  to  POMOW  (Wizard  rejects 
only  27  of  POKitOW's  correct  word  hypotheses).  Its  poorer  average  rank  of  5.6  (as 
opposed  to  2.9  for  Noah)  is  due  to  Wizard.  As  described  in  Section  1.3.1|  POMOW's 
Knowledge  is  limited  to  what  can  be  learned  from  the  segment  labels  for  seven 
equivalence  classes  of  phonemes  and  stored  in  a Markov  probability  model.  Wizards 
Knowledge  is  obtained  manually  by  looking  at  samples  of  segmented  and  labeled  speech 
for  each  word.  Noah  was  able  to  automatically  learn  segment  patterns  from  more 
samples  than  was  possible  for  Wizard  and  store  more  detail  than  did  POMOW. 

6.4.2  Lexical  Retrieval  Component  in  the  HWIM  System  . 

If  Noah  is  restricted  to  hypothesizing  only  the  best-rated  15  across  the  whole 
utterance,  its  performance  can  be  compared  to  that  of  the  Lexical  Retrieval  Component 
of  the  HWIM  System  [Klovstad  --1976],  The  Lexical  Retrieval  Component  "scans"  each 
utterance  (using  a 1097-word  vocabulary)  to  find  the  best  matching  words  rated  better 
than  a set  threshold;  up  to  15  words  are  found. 

The  comparison  is  somewhat  artificial  because  the  systems  are  using  different 
lower  level  acoustic  processors.  Noah  as  using  a segmenter-labeler;  the  Lexical 
Retrieval  Component  is  using  an  "Acoustic  Phonetic  Recognizer"  [Woods,  et  al.  - 1976]. 
The  APR  hypothesizes  phones  (a  higher  level  than  the  segment  level)  and  has  a best- 
rated phone  label  accuracy  about  87  higher  than  the  best-rated  segment  label  accuracy 
for  the  segmenter-labeler  which  Noah  uses  (527  compared  to  447)®.  However,  an  error 
for  a phone  label  is  more  costly  than  an  error  for  a segment  label,  since  the  phone  label 
attempts  to  give  more  information.  In  addition,  the  word  hypothesizers  are  using 
vocabularies  which  differ  in  content  as  well  as  in  size  (1011  words  for  Noah  versus 
1097  words  for  the  HWIM  system)  and  different  test  utterances. 

of  the  46K  total)  also  includes  code  not  used  for  POMOW  hypotheses;  this  code  is 
used  for  verifying  words  hypothesized  by  the  syntactic  and  semantic  Knowledge 
source  of  the  system.  (Noah  was  not  intended  to  replace  this  task  of  the  verifier.) 

6 These  values  have  been  normalize  to  account  for  the  difference  of  73  phone  labels 
used  by  the  APR  and  98  segment  labels  used  by  the  segmenter-labeler  of  the 
Hearsay-II  system. 


L. 


6 Results  and  Analysis 


87 


The  following  two  tables  compare  the  performances  of  the  two  systems: 


Avg.  No.  of  Uords  per  Sentence: 
Avg.  No.  of  Uord  Hypotheses: 
Avg.  No.  of  Correct  Uords: 

Avg.  No.  of  Incorrect  Uords: 
Avg.  Ratio  of  Correct 

to  Incorrect  Uords: 


Noah 

HUItl  Lexical 

Retrieval  Component 

6.71 

6.20 

15.00 

12.30 

2.86 

2.17 

12.14 

10.13 

.2356  .2142 


Table  6.4:  Correct  versus  Incorrect  Uord  Hypotheses 


Rank: 

Noah  Xi 
HUin  LRC  X-. 


12  3 4 
69  82  86  88 
58  74  80  84 


5 6 7 8 
92  94  95  96 
85  86  88  88 


9 10  11  12 
96  96  96  97 
90  91  92  92 


13  14  15  >15 
98  98  98  100 
92  92  92  100 


Table  6.5:  Rank  Distribution  of  Best  Correct  Uord  Hypothesis 


According  to  both  measurements  — the  ratio  of  correct  to  incorrect  words  and 
the  rank  distribution  of  the  best  correct  word  --  Noah  performs  somewhat  better  than 
the  Lexical  Retrieval  Component  of  the  HWIM  system.  However,  we  should  keep  in  mind 
the  difficulties  in  comparing  these  systems.  No  data  is  available  on  the  speed  and  size 
of  the  lexical  retrieval  component. 


END 


3-78 


CARNE«It-»«LLON  UNIV  PITTSeURftH  PA  DEPT  OF  COMPUTER  —ETC  F/6  17/2 
MORO  HYPOTHESIZATXON  FOR  LARSE-VOCABULARY  SPEECH  UNDERSTAN0IN6  — ETC(U) 
OCT  77  A R SMITH  F44620-7S-C-0074 


AO>A049  287 


AFOSR-TR-78-0005 


UNCLASSIFIED 


4PM 


■Sis 


AO 

A048287 


Uttt 


-IH 


HH 


mitt 


r/-£cedv/(f  ~ 


Chapter  7 Summary  and  Conclusions 

This  final  chapter  summarizes  the  thesis  and  draws  some  conclusions  before 
giving  the  contributions  of  the  thesis.  The  final  section  suggests  improvements  for  Noah 
and  points  to  future  work. 


7.1  Summary 

V 

^'This  thesis  describes  research  directed  toward  the  development  of  a general 
English  speech  understanding  system.  In  particular,  the  thesis  presents  the  design  and 
performance  of  a bottom-up  word  hypothesizer  (Noah)  capable  of  handling  very  large 
vocabularies.  The  design  of  Noah  is  based  on  a hierarchy-tree  structure.  Speech  is 
represented  at  four  levels  of  a hierarchy.  A tree  maps  the  representation  of  speech  at 
one  level  to  the  representation  of  speech  at  the  next  higher  level  by  a tree,  Since  this 
design  of  Noah  has  been  summarized  in  Chapter  1 (Section  1.4:  Overview  of /Noah),  we 
restrict  the  summary  here  to  Noah’s  performance  and  runtime  characteristics.  , j 

r C, 

7.1.1  Performance 

The  word  hypothesizer  was  trained  on  174  utterances  and  tested  on  105  new 
utterances  (705  words)  for  7 different  vocabulary  sizes  ranging  from  500  words  to 
19,000  words.  The  performance^  for  these  vocabularies  ranged  from  a word  accuracy 
of  73^  at  an  average  rank  of  2.6  for  the  500-word  vocabulary  to  a word  accuracy  of 
587.  at  an  average  rank  of  5.8  for  the  19,000-word  vocabulary.  The  rank  of  a word 
hypothesis  was  limited  to  be  less  than  or  equal  to  20  (i.e.,  a maximum  of  20  hypotheses 
per  word  were  allowed).  According  to  the  average  efficiency  measure,  which  combines 
word  accuracy  and  rank  measures,  this  performance  degrades  at  approximately  a 
logarithmic  rate  over  the  range  of  vocabulary  sizes  tested. 

An  analysis  of  the  effect  of  training  sample  size  on  performance  shows  that 
Noah’s  performance  will  not  improve  much  more  with  training  if  the  current  method  of 
using  arbitrary  training  utterances  is  used.  We  therefore  suggested  (Section  6.3.2)  that 
selective  training  is  needed,  i.e.,  training  for  those  sylparts  for  which  recognition  errors 
are  made. 

1 See  Section  6.1.1  for  an  explanation  of  performance  measures. 


90 


7.1.2  Runtim*  Charactaristicc 

The  Noah  word  hypothesizer  requires  about  87K  of  36  bit-words  for  storage  of 
1)  recognition,  debugging,  analysis,  and  statistics  code,  2)  runtime  constants  and 
variables,  3)  hypotheses  for  a 3-second  utterance,  4)  sylpart  training-dependent 
Knowledge  (174  training  utterances),  and  5)  vocabulary-dependent  knowledge  (1000 
words).  The  vocabulary  dependent  knowledge  for  1000  words  uses  IIK  of  memory. 
This  increases  to  148K  (13  times  greater)  for  19,000  words. 

The  computation  costs  begin  at  2.4  MIPSS  (million  of  instructions  per  second  of 
speech)  for  the  500-word  vocabulary  and  increases  at  a. logarithmic  rate  to  6.6  MIPSS 
for  the  19,000-word  vocabulary.  (This  is  an  increase  of  about  .75  MIPPS  per  doubling 
of  the  vocabulary  size).  Ii.  'arms  of  processing  time  for  a POP-KLIO  processor  (the  1.3 
MIPS  machine  used  for  the  tests),  the  time  to  hypothesize  the  words  for  an  utterance 
ranges  from  3.1  to  8.6  times  the  time  it  takes  to  speak  the  utterance,  over  the  same 
range  of  vocabuiary  sizes.  If-  the  high  degree  of  parailelism  permitted  by  the 
recognition  algorithm  were  exploited,  these  times  could  be  greatly  reduced. 

Qx-v  •/  A p ) ■ ■ 

7.2  Conclusions  C.  r Ja  ^ 

The  major  conclusion  from  the  results  of  the  thesis  is  that  bottom-up  word 
hypothesization  is  not  greatly  effected  by  the  size  of  the  vocabuiary.  Wfr  were 
pleasantly  surprised  that  the  effect  of  vocabulary  size  on  performance  and  on 
computation  costs  would  be  approximately  according  to  the  logarithmic  of  the 
vocabulary  size.  This  result4s-yer^-pleasing-in-thaUt^uggests  tha^with  improvements 
in  the  word  hypothesizer  and  the  segmenter-labeler,  speech  understanding  systems  for_ 
general  English  can  obtain  a great  amount  of  constraint  from  the  acoustics  alone.  *^{000 
the  main  thrust  of  this  thesis  was  not  in  building  an  optimal  word  hypothesizer  but  in 
building  one  which  could  handle  large  vocabularies,  many  possible  performance 
improvements  were  set  aside  because  of  the  time  and  effort  needed  to  implement  and 
test  them.  These  are  suggested  in  the  final  section.  It  is  also  expected  that  better 
segmenter-labelers  will  come  about;  when  they  do,  it  will  be  easy  to  adapt  Noah  to  them 
and  thus  improve  its  performance. 

One  thing  that  cannot  be  concluded  from  the  thesis  is  that  building  a speech 
understanding  systems  for  general  English  will  now  be  easy.  It  is  not  Known  how  a 
speech  system  reacts  to  a bottom-up  word  hypothesizer  with  the  rate  of  performarKS 
degradation  given  here.  Also,  it  is  not  Known  how  such  a system  reacts  to  an  increase 
in  the  complexity  of  the  grammar  beyond  the  narrow  ranges  tested  so  far.  Of  course, 
any  improvement  in  the  word  hypothesizer  will  make  the  job  easier. 


7 Summary  and  Conclusions 


91 


The  thesis  has  shown  that  for  word  hypothesization  it  is  possible  to  handle 
many  of  the  coarticulation  problems  at  a low  level  in  the  recognition  algorithm.  This 
permits  storing  only  a base  pronunciation  for  words,  saving  storage  and  saving  effort  in 
acquiring  acoustic  descriptions  of  words.  However,  it  is  quite  possible  that  for  word 
verification,  more-detailed  descriptions  will  have  to  be  stored,  either  at  the  word  level 
or  at  the  syllable  level. 

7.3  Contributions 

The  thesis  has  contributed  in  the  following  ways: 

> The  thesis  is  a step  toward  speech  understanding  systems  for  general  English. 
It  has  shown  that  increasing  the  vocabulary  size  for  a particular  bottom-up  word 
hypothesizer,  decreases  its  performance  and  increases  its  computation  costs 
approximately  according  to  the  logarithmic  of  the  vocabulary  size.  This  in  turn  has 
given  a feel  for  the  complexity  of  the  "word  sound  similarity  space"  for  English. 

> The  thesis  presents  the  design  of  a bottom-up  word  hypothesizer  which 
performs  better  than  the  POMOW  word-hypothesizer  / Wizard  word-verifier  of  the 
Hearsay-II  system  and  the  lexical  retrieval  component  of  the  HWIM  system  (the  only 
other  known  bottom-up  word  hypothesizers).  The  Noah  word  hypothesizer  has  a word 
accuracy  of  737.  with  an  average  rank  of  2.9  for  the  correct  hypotheses  when  using  a 
1000-word  vocabulary.  This  compares  to  a word  accuracy  of  657  at  an  average  rank  of 
4.5  for  POMOW/Wizard.  Also,  Noah  runs  almost  an  order  of  magnitude  faster  than 
POMOW/Wizard.  Noah’s  best  rated  word  hypothesis  for  an  utterance  is  correct  69X  of 
the  time.  This  compares  favorably  to  a value  of  587  for  the  lexical  retrieval  comporant 
of  the  HWIM  system. 

> The  thesis  demonstrates  a solution  to  the  problem  of  knowledge  acquisition  for 
AI  knowledge-based  systems.  The  solution  is  to  separate  the  knowledge  into  a)  a priori 
knowledge:  general  knowledge  which  is  easily  acquired  and  b)  learned  knowledge  which 
completes  the  general  knowledge,  is  acquired  by  training  the  system,  and  is  specific  to 
the  particular  conditions  under  which  the  system  operates.  For  word  hypothesization 
this  solution  has  taken  the  form  of  a)  acquiring  base  word  pronunciations  from  a 
pronunciation  dictionary  as  the.  a priori  knowledge  and  b)  automatically  learning 
segment-label  patterns  of  a particular  segmenter-labeler  for  subparts  of  the  word 
pronunciations  (i.e.,  sylparts).  Thus,  the  word  hypothesizer  is  able  to  acquire  the 
knowledge  for  19,000  words,  but  at  the  same  time  remain  free  from  ties  to  a particular 
segmenter-labeler. 

> A hierarchy-tree  structure  of  knowledge  representation  was  presented  that 


gives  a way  of  combining  the  advantages  of  both  a hierarchy  structure  and  a tree 
structure  for  reducing  the  costs  of  a bottom-up  recognition  algorithm.  The  multiple 
levels  of  the  structure  prevent  a potential  combinatoric  explosion  of  alternative 
hypotheses  by  permitting  control  of  the  hypotheses  at  each  level.  In  addition  the 
structure  provides  a framework  for  a)  storing  a priori  knowledge  and  learned 
knowledge  separately  and  b)  using  both  types  of  knowledge  jointly. 

> The  thesis  demonstrates  a method  of  learning  the  context  of  a pattern  and 
using  that  context  during  recognition  to  constrain  the  possible  interpretations  of  the 
pattern.  This  context  learning  handles  some  of  the  coarticulation  problems  present  at 
the  sylpart  level  of  speech  by  making  the  interpretation  of  a segment-label  pattern  (i.e., 
the  sylpart  it  represents)  dependent  in  part  on  the  similarity  of  a)  the  context  (i.e.,  left 
and  right  adjacent  segment  labels)  of  the  segment-label  pattern  in  the  speech  being 
recognized,  and  b)  context  previously  learned  for  the  segment-label  pattern. 

> Two  measures  (average  efficiency,  and  the  confusion  of  hypotheses)  were 
given  which  are  useful  for  understanding  systems  which  use  the  hypothesis-and-test 
paradigm.  The  average  efficiency  measure  gives  a way  of  reducing  to  one  number  the 
inter-related  measures  of  a)  the  number  of  correct  hypotheses  and  b)  their  ranking 
amongst  incorrect  hypotheses.  This  makes  monitoring  of  the  system’s  performance  for 
design  changes  and  different  test  conditions  easier.  The  confusion  measure  for 
hypotheses  shows  how  the  competition  of  hypotheses  changes  in  different  parts  of  the 
system  as  information  is  used  to  create  new  hypotheses  from  old  hypotheses. 


7.4  Other  Applications 

In  this  section  we  speculate  on  other  possible  applications  of  the  research.  In 
particular,  we  will  look  briefly  at  using  the  knowledge  and  knowledge  representation  of 
Noah  for  investigating  what  has  been  called  the  "word  sound  similarity  space".  The 
hierarchy-tree  representation  of  knowledge  is  suggested  for  use  in  image 
understanding. 

7.4.1  Analysis  of  the  Word  Sound  Similarity  Space 

One  can  imagine  an  abstract  multidimensional  space  in  which  the  sound  of  a word 
is  represented  by  a point  in  the  space  (or  perhaps  by  a set  of  points  to  account  for  the 
many  ways  the  word  might  occur  in  speech)  and  the  similarity  between  the  sound  of 
two  words  is  represented  by  the  distance  between  their  corresponding  points  as 
measured  by  some  metric.  We  call  this  the  word  sound  similarity  space.  The  knowledge 
and  Knowledge  representation  of  Noah  is  one  possible  model  of  this  space.  For  the 


7 Summary  and  Conclusions 


93 


thesis,  this  model  was  used  to  identify  unknown  points  in  the  space  (defined  by  segment 
patterns)  by  finding  the  "closest*  prelabeled  points  (i.e,  the  words  of  the  vocabulary). 

It  is  possible  to  use  the  same  model  to  investigate  the  distribution  of  words  in  the 
space.  For  example,  the  following  kinds  of  questions  could  be  answered  for  different 
vocabularies:  What  are  the  regions  of  greatest  density  — that  is,  what  words  have  the 
most  competition?  If  a set  of  words  is  added  to  the  vocabulary  what  is  the  expected 
change  in  the  performance  of  the  word  hypothesizer?  How  does  the  word  density  of 
one  vocabulary  compare  with  another  vocabulary?^ 

To  answer  these  questions,  the  hierarchy-tree  represent^*ion  can  be  used  to 
compute  the  "distance"  between  words.  This  distance  can  be  defined  by  recursively 
finding  the  distance  between  the  parts  of  the  words  which  are  stored  in  the  levels  of 
the  representation:  The  distance  between  two  words  is  based  on  the  distance  between 
their  corresponding  syllables;  the  distance  between  syllables  is  based  on  the  distance 
between  the  corresponding  sylparts;  the  distance  between  sylparts  is  based  on  an 
average  distance  between  their  several  segment  patterns;  and  the  distance  between 
segment  patterns  is  based  on  an  experimentally-derived  confusion  matrix  for  segment 
labels.  The  real  advantage  of  Noah’s  particular  modal  of  the  word  space  becomes 
apparent  when  one  wants  to  find,  In  parallel,  the  closest  words  to  a particular  word  (or 
perhaps  a sequence  of  words).  This  is  done  in  two  steps:  the  word  is  expanded 
recursively  into  its  syllables  and  sylparts  and  then  into  the  segment  patterns  for  each 
sylpart.  These  segment  patterns  are  then  used  as  input  to  Noah.  By  forcing  Noah  to 
use  a predetermined  segment  labei-fo-labei  distance  metric  rather  than  individual  rated  ) i 

segment  labels,  it  can  "hypothesize"  words  based  on  the  segment  patterns  of  the 
particular  word  to  find  its  closest  words.  By  using  different  "Input"  words  the  above 
questions  can  be  answered  quickly. 

i 

7.4.2  Image  Understanding  | 

Before  speculating  on  the  applications  of  the  hierarchy-tree  representation  to 
the  domain  of  image  understanding,  we  will  outline  some  the  concepts  present  in  the 
representation  (these  are  not  necessarily  unique  to  this  representation).  These 
concepts  are:  a hierarchy  of  levels  gives  levels  of  abstrection  to  the  knowledge;  each 
level  Is  defined  by  a lexicon  of  units;  a tree  structure  between  adjacent  levels  ties  the 
levels  together  by  storing  the  patterns  of  units  at  the  lower  level  to  define  a higher 
level  unit;  a tree  permits  storing  compactly  the  knowledge  which  is  peculiar  to  a pair  of 
levels;  and  finally,  contextual  information  is  stored  to  constrain  better  the  interpretation 
of  the  units  at  one  level  for  the  next  higher  level. 

Image  understanding  also  fits  this  model.  Vision  is  commonly  represented  by  a 


2 This  is  a question  that  [(k>odman  - 1976]  investigated  for  smaller  vocabularies. 


94 


hierarchy  of  levels  of  abstraction  ([McKeown  & Reddy  - 1977]  give  6 possible  levels: 
pixel,  patch,  region,  object,  cluster,  and  scene).  A tree  could  be  used  to  store  compactly 
the  patterns  of  units  at  one  level  to  define  the  units  at  the  next  higher  level.  The  major 
difference  between  these  trees  and  those  found  in  Noah  is  that  they  would  not  store 
time  adjacent  units  but  positionally  adjacent  units.  Techniques  would  have  to  be 
developed  to  linearize  the  vision  patterns  for  storage.  The  lower-level  trees  could  be 
used  more  for  detailed  positional  knowledge;  the  higher-level  trees  could  contain  a more 
semantic  type  of  knowledge.  In  all  of  the  trees,  contextual  information  could  be  used  to 
constrain  the  interpretations  of  the  patterns  (positional  context  at  the  lower  levels, 
semantic  context  at  the  higher  levels).  We  believe  that  other  problem  domains  using  the 
above  concepts  could  also  use  the  hierarchy-tree  representation. 


7.5  Future  Research 

The  discussion  of  future  research  is  divided  into  a section  on  suggestions  for 
improving  Noah  and  a section  on  research  for  intergrating  Noah  into  a total  speech 
system. 

7.5.1  Suggested  Improvements  for  Noah 

The  suggested  improvements  are  given  in  order  of  increasing  probable  gain  in 
performance.  This  generally  corresponds  to  an  increasing  amount  of  work. 

7.5.1. 1 Tuning  the  System 

Though  tuning  can  be  a never-ending  task,  it  is  expected  that  better 
performance  can  be  obtained  (particularly  for  multisyllabic  words)  by  tuning  the  current 
thresholds  and  parameters  which  control  the  rating  of  hypotheses  at  each  level.  These 
thresholds  and  parameters  control  context  scoring,  weight  penalties,  normalization  for 
the  length  of  a pattern,  and  acceptance  of  hypotheses  at  each  level. 

7.5. 1.2  Selective  Training 

Learning  the  segment  patterns  and  context  for  those  sylparts  causing  errors  is 
suggested  for  increasing  the  effectiveness  of  training.  One  way  to  do  this  is  to 
incorporate  in  the  word  hypothesizer  a method  of  automatically  training  on  the  segment 
patterns  for  those  sylparts  which  cause  a word  to  be  missed.  Much  of  the  work  has 
already  been  done  for  this  in  the  form  of  an  analysis  program  which  shows  why  a word 
was  missed  or  why  it  was  rated  poorly.  The  names  and  positions  of  the  correct  words 
could  be  obtained  either  from  interaction  with  the  user  or  from  feedback  from  the  total 
speech  system  when  it  manages  to  recognize  the  utterance  in  spite  of  some  missing 
correct  bottom-up  word  hypotheses. 


7 Summary  and  Conclusions 


95 


More  information  can  be  used  by  Noah  to  Improve  its  performance.  We  suggest 
four  types  of  information  here:  stress,  context,  duration  and  syllable  boundaries.  Each 
of  these  would  require  considerable  research  to  design,  test,  and  adjust.  Another 
problem  is  that  an  increase  in  the  amount  of  information  used  by  Noah  causes  an  even 
greater  increase  in  the  amount  of  training  needed. 

Syllable  stress  information  could  be  used  in  several  ways:  1)  as  a further  test 
for  multisyllabic  words  — the  stress  pattern  of  a word  would  have  to  match  the  stress 
pattern  of  the  utterance,  2)  as  a way  of  reducing  erroneous  function  word  hypotheses 
and  3)  as  part  of  the  vowel  recognition  process  — a schwa  should  probably  not  be 
hypothesized  in  a stressed  location.  • 

The  context  used  by  Noah  for  recognizing  sylparts  is  small.  This  could  be 
increased  to  more  segments  as  required  by  particular  sylparts  or  to  other  types  of 
information,  such  as  the  stress  of  surrounding  syllables.  The  difficulty  with  learning  and 
using  context  is  that  the  system  or  designer  must  determine  when  the  context  is  really 
effecting  the  patterns  for  the  sylparts. 

The  duration  of  the  segments  in  each  segment  pattern  was  once  used  in  the 
Noah  recognition  algorithm  without  success.  Duration  information  may  still  be  useful  if  it 
is  used  for  the  total  segment  pattern.  Duration  information  could  also  be  used  in 
conjunction  with  ampiitude  information  to  determine  syllable  stress. 

It  is  firmly  believed  that  detection  of  syllable  boundaries  would  greatly  reduce 
the  number  of  incorrect  syilabie  hypotheses  and  in  turn,  incorrect  word  hypotheses.  An 
example  of  the  confusion  resulting  from  lacK  of  syllable  boundaries  was  seen  in  Figure 
6.6.  Mermelstein  [Mermelstein  - 1975]  has  demonstrated  the  ability  to  recognize 
syllable  boundaries  with  an  error  rate  of  6.9^  syllables  boundaries  missed  and  2.6t 
extra  syllables  boundaries  found.  It  seems  that  this  performance  like  this  would  greatly 
increase  Noah’s  performance  for  large  vocabularies. 

7.5.2  Speech  Sysfem  Integration  for  Noah 

7.5.2.1  Performance  within  a Particular  System 

More  work  needs  to  be  done  in  evaluating  Noah  within  the  constraints  of 
particular  speech  systems.  For  example,  suppose  a system  is  not  very  good  at  dealing 
with  missing  correct  word  hypotheses,  but  is  abie  to  handle  many  hypotheses  efficiently 
and  effectively.  The  performance  evaluation  of  Noah  relevant  to  such  a system  is  the 
percent  of  correct  word  hypotheses  at  a constant  average  rank  as  the  vocabulary 
increases,  without  much  regard  to  the  total  number  of  hypotheses.  The  measures  of 
performance  given  in  the  thesis  are  not  weighted  in  this  directioa  In  addition,  Noah 


F' 


I 


96 

should  be  analyzed  in  terms  of  cost-effectiveness  for  the  system  using  it. 

7.5.2.2  System  Control 

The  performance  of  Noah  can  be  improved  if  it  is  controlled  intelligently  by  the 
speech  system  using  it.  Any  a priori  Knowledge  that  can  be  used  to-  constrain  the  word 
hypothesizer  effectively  increases  its  performance.  Three  types  of  constraints  are 
suggested  here.  Each  of  them  can  be  applied  over  the  whole  utterance  or  in  seiected 
parts  of  the  utterance. 

As  discussed  in  Section  6.3.7,  the  word  hypothesizer  can  be  controlled  to 
'hypothesize  words  with  specific  characteristics.  For  example,  function  words  can  be 
"tuned  out".  The  overall  needs  of  the  particular  speech  system  or  its  needs  at 
particular  parts  of  the  utterance  can  be  adjusted  for. 

Word  hypotheses  could  also  be  biased  a priori  by  the  expected  topic  of  the 
utterance.  For  example,  as  in  these  tests,  if  the  expected  topic  was  Artificial 
Intelligence  articles,  the  words  of  the  1000-word  Hearsay-II  dictionary  could  be  biased 
over  the  other  words  in  the  19,000-word  dictionary.  This  biasing  can  be  done  by 
penalizing  those  words  not  included  by  the  topic.  Only  penalizing  of  words  should  he 
done  and  not  elimination,  to  permit  the  speaker  to  change  topics. 

Finally,  words  at  a particluar  location  in  the  utterance  could  be  selected  by  the 
speech  system  for  hypothesization  according  to  a particular  part  of  speech  and  topic 
(e.g.,  color  adjective,  person’s  name,  etc.).  Generally,  this  is  done  by  word  verifiers  for 
speech  systems,  but  if  the  set  of  words  (such  as  "all  nouns")  is  very  large  the  word 
hypothesizer  should  be  called.  Since  the  vocabulary  size  is  effectively  reduced  by 
these  constraints,  the  thresholds  of  the  word  hypothesizer  could  be  relaxed  to  avoid 
missing  the  correct  word  hypothesis. 

7.5.3  Great  Expectations 

It  is  expected  that  the  performance  of  bottom-up  word  hypothesization  can  be 
improved  to  a word  accuracy  of  80Z  at  an  average  rank  of  3 with  an  average  of  10 
competing  hypotheses  for  each  utterance  word,  using  a 20,000-word  vocabulary.  This 
improvement  would  be  made  by  including  the  above  suggestions  and  by  improvements 
in  segmenting  and  labeling.  Though  in  some  sense  these  numbers  are  "out  of  the  blue", 
we  have  seen  that  the  performance  of  the  Noah  word  hypothesizer  equals  the 
performance  of  the  POMOW  word  hypothesizer  and  the  Wizard  word  verifier  of  the 
Hearsay-II  speech  system  using  a vocabulary  almost  8 times  smaller.  We  estimate  that 
the  above  numbers  represent  the  same  magnitude  of  performance  increase  over  Noah 
and  could  be  achieved  with  about  the  same  amount  of  effort  (about  two  man-years).  If 
the  performance  of  such  an  hypothesizer  were  effected  by  large  vocabularies  in  the 


7 Summary  and  Conclusions 


same  way  as  Noah  (and  there  is  no  reason  to  think  otherwise),  it  would  have  a word 
hypothesization  accuracy  of  73t  at  an  average  rank  of  4.5  for  a 100,000-word 
vocabulary. 


I <7  ' ^ /,  /^e>r 

■r/-£cediA'i^  ~ /^,l^g{/ 


References 


Bahl,  L R.,  et.  al.  Preliminary  results  on  the  performance  of  a system  for  the  automatic 

recognition  of  continuous  speech.  J976  IEEE  Inter.  Conf.  on  Acoustics,  Speech  • 
and  Signal  Processing,  Philadelphia,  Apr.,  1976,  425-429. 

Baker,  J.  K.  Stochastic  modeling  as  a means  of  automatic  speech  recognition.  Tech. 
Report,  CMUCSD,  1975.  Ph.D.  Dissertation  --  The  Dragon  system. 

CMU  Computer  Science  Speech  Group.  (1976)  Working  papers  in  speech  recognition  - 
IV  - the  Hearsay-II  system.  Tech.  Report,  CMUCSD,  1976. 

CMU  Computer  Science  Speech  Group.  (1977)  Summary  of  the  CMU  five-year  ARPA 
effort  in  speech  understanding  research.  Tech.  Report,  CMUCSD,  1977. 

Cole,  A.  The  ARPA-SUR  phonological  rules:  summary  and  index.  ARPA  Speech 
Understanding  Research  Note  No.  136.,  May  1974.,  (Cole  is  at)  the 
Computational  Speech  and  Language  Group,  Electronics  Research  Laboratory, 
University  of  California  at  Berkeley. 

Erman,  L D.  A functional  description  of  the  Hearsay-II  system.  Proc.  1977  IEEE  Inter. 
Conf.  on  A5SP,  Hartford,  CT,  May,  1977,  799-802. 

Feigenbaum,  E.  The  Art  rf  Artificial  Intelligence:  I.  Themes  and  case  studies  of 
knowledge  engineering,  Proc.  IXAJ-77,  Mass,,  Aug,,  1977,  1014-1029. 

Forgie,  1 W.  Overview  of  the  Lincoln  System.  IEEE  Symposium,  on  Speech  Recognition, 
Carnegie-Mellon  University,  1974,  27. 

(aoldberg,  K Performance  of  the  Hearsay-II  segmenter-Labeler.,  Private  communication, 
1977. 

Goodman,  G.  Analysis  of  languages  for  man-machine  voice  communication.  Tech. 

Report,  CMUCSD,  May,  1976.  (Ph.D.  Dissertation,  Comp.  Sci.  Dept.,  Stanford 
University) 

Hayes-Roth,  F.  and  Mostow,  D.  J.  An  automatically  compilable  recognition  network  for 
structured  patterns.  Proe.  IJCAJ-7S,  Tbilisi,  USSR,  Aug.,  1975.  Also  appeared 
in  [CMU  Computer  Science  Speech  Group  - 1976]. 

Hayes-Roth,  F.,  Mostow,  D.  J.,  and  Fox,  M.  Understanding  speech  in  the  Hearsay-II 
system.  In  Natural  Language  Communication  with  Computers.  (Bloc,  L,  Ed.) 
Springer -Verlag,  Berlin,  1977,  (in  press). 


100 


ItaKura,  F.  Minimum  prediction  residual  principle  applied  to  speech  recognition,  I97S 
IEEE  Traru.  ASSP-23,  67-72. 

Klovstad,  J.  W.  Probabilistic  lexical  retreval  component  with  embedded  phonological 
word  boundary  rules.  Technical  Report  In  Woods,  et  al.,  Sp**ch  Und»r*tanding 
Systems  Technical  Progress  Report  No.  6,  Bolt,  BeraneK  and  Newman  Inc.,  Apr., 
1976,  68-108.  Also  a PRO.  Dissertation,  to  be  published. 

Knuth,  D.  (1968)  The  Art  of  Computer  Programming:  FoL  1,  Fundamental  Algorithms, 
Addison- Wesley,  Menlo  Park,  Ca.,  1968. 

Knuth,  D.  (1973)  The  Art  of  Computer  Programming:  VoL  3,  Sorting  and  Searching, 
Addison-Wesley,  Menlo  Park,  Ca.,  1973. 

Lowerre,  B.  T.  The  Harpy  speech  recognition  system.  Tech.  Report,  CMUCSD,  1976. 
Ph.O.  Dissertation. 

K^cKeown,  D.  M.  and  Reddy,  D.  R.  (197'^)  A hierarchical  symbolic  representation  for  an 
image  database.  Proceedings  of  IEEE  Workshop  on  Picture  Data  Description  and 
Management,  Chicago,  III.,  Apr.,  1977. 

McKeown,  D.  M.  Word  verification  in  the  Hearsay-II  speech  understanding  system. 

Proc.  1977  IEEE  Inter.  Conf.  on  Acoustics,  Speech  and  Signal  Processing, 
Hartford,  CT,  May,  1977,  795-798. 

Mermelstein,  P.  Automatic  segmentation  of  speech  into  syllabic  units.  Jourruxl  of  the 
Acoustic  Society  of  America,  58.,  1975,  880-883. 

Stiller,  GL  and  S.  Isard  Some  perceptual  consequences  of  linquistic  rules.  Journal  of 
Verbal  Learning  and  Verbal  Behavior,  Vol.  2,  1963,  217-228. 

Newell,  et  al.  Speech  understanding  systems;  final  report  of  a study  group.  North 
Holland,  1973.  (Originally  appeared  in  1971). 

Newell,  A.  A tutorial  on  speech  understanding  systems.  In  Speech  Recognition:  Invited 
Papers  of  the  IEEE  Symp..  (Reddy,  D.  R.,  Ed.)  Academic  Press,  New  York,  NY, 
1975,  3-54. 

Olney,  J.,  and  D.  Ramsey  From  machine  dictionaries  to  a lexicon  tester:  progress,  plans 
and  an  offer.  Computer  Studies  in  the  Humanities  and  Verbal  Behavior,  Vol.  3, 
No.  4,  Nov.,  1972,  213-220. 

Reddy,  D.  R.  An  approach  to  computer  speech  recognition  by  direct  analysis  of  the 
speech  wave.  Tech.  Report,  Stanford  University,  Al  Memo  43,  Stanford,  CA, 
1966.  Ph.D.  Dissertation. 

Reddy,  D.  R.  and  Vicens,  P.  J.  A procedure  for  segmentation  of  connected  speech.  J. 
Audio  Engr.  Soe.  16  Apr.,  1968  404-412. 

Reddy,  D.  R.,  Erman,  L D.  and  Neely,  R.  B.  (1972)  A mechanistic  model  of  speech 
perception.  Proc.  1972  IEEE  Conf.  Speech  Communication  and  Processing, 
Newton,  k4A,  Apr.,  1972,  334-337. 

Reddy,  0.  R.,  L D.  Erman,  and  R.  B.  Neely  (1973)  The  HEARSAY  speech  understanding 
system:  an  example  of  the  recognition  process.  Proc.  3rd  Inter.  Joint  Conf.  on 
Artificial  InteL,  Stanford,  Ca.,  1973,  185-193. 


I 


I 


101 


Rubin,  F.,  Experiments  in  text  file  compression.  Comm,  of  the  flCM,  Vol.  19,  Nov.,  1976, 
617-623. 

Shannon,  C.  E.,  A mathematical  theory  of  communication,  Bell  System  Technical  Journal, 
27,  1948. 

Shockey,  L and  Adam,  C.  The  phonetic  component  of  the  Hearsay-II  speech 
understanding  system.  In  [CMU  Computer  Science  Speech  Group  - 1976]. 

Sivertsen,  E.,  Segment  inventories  for  speech  synthesis,  based  on  University  of 
Michigan  Speech  Research  Laboratory  Report  No.  5.,  1961. 

Smith,  A.  R.  Word  hypothesization  in  the  Hearsay.-ll  speech  system.  Proe.  1976  IEEE 
Inter.  Conf.  on  ASSP,  Philadelphia,  PA,  Apr.,  1976,  549-552.  Also  in  [CMU 
Computer  Science  Speech  Group  - 1976]. 

Vicens,  P.  J.  Aspects  of  speech  recognition  by  computer.  Tech.  Report,  Stanford 

University,  AI  Memo  85,  Stanford,  CA,  1969.  Ph.D.  Dissertation. 

Woods,  W.  A.  Transition  network  grammars  for  natural  language  analysis,  Comm,  of  the 
ACM,  Vol.  13,  Oct.,  1970,  591-606. 

Woods,  W.,  et.  al.  Speech  understanding  systems:  final  report.  Bolt,  Beranek  and 
Newman  Inc.,  Oct.,  1976. 


Appendix  A:  "ARPABET”  Computer  Phonetic  Representation 


Phoneme 

Computer 

Representation 

EKamp 1 a 

//  Phoneme 

Computer 

Representation 

Ewamp I e 

1 

lY 

beat 

n 

N 

net 

I 

IH 

bit 

Q 

NX 

sing 

e 

EY 

bai  t 

P 

P 

pet 

c 

EH 

bet 

t 

T 

ten 

m 

AE 

bat 

k 

K 

kit 

a 

AA 

bomb 

b 

B 

bet 

A 

AH 

but 

d 

D 

debt 

3 

AO 

bought 

g 

G 

get 

O 

OU 

boat 

h 

HH 

hat 

U 

UH 

booK 

f 

F 

fat 

u 

UU 

boot 

e 

TH 

thing 

a 

AX 

about 

8 

S 

sat 

3 

ER 

bird 

s 

SH 

shut 

au 

AU 

down 

V 

V 

vat 

al 

AY 

buy 

i 

DH 

that 

ol 

OY 

boy 

z 

Z 

zoo 

W 

Y 

you 

i 

ZH 

azure 

u 

U 

Mit 

syl  1 

EL 

battle 

r 

R 

rent 

syl  m 

EM 

bottom 

1 

L 

let 

syl  n 

EN 

button 

m 

n 

met 

I 


Appendix  B:  Lexicons 

This  appendiK  includes  the  sylpart  lexiconsi  a list  of  the  segment  class  labels, 
and  the  2000-word  lexicoa 

Sylpart  Lexicons 

The  number  after  each  entry  giyes  the  number  of  training  samples  found  for  the 
■ sylpart  In  the  174  utterance  training  set. 


VOMEL  LEXICON 


lY  iia 

IH  264 

EY  68 

EH  188 

RE  187 

BS77 

RO  19 

OU  36 

UH  3 

UU  47 

IX  a 

RX  29S 

ER  86 

RU  27 

RY  71 

ov  a 

EL  38 

EN  5 

EH  14 

VOUEL  SEQUENCE 

LEXICON 

EH  L « 

OU  R 3 

Y UH  8 

RX  RO  6 

OU  L 3 

nn  R 23 

RO  R 9 

R IH  17 

EH  R 3 

HH  ER  RX  1 

lY  EY  4 

R OU  3 

lY  RR  4 

RX  H 2 

L lY  6 

RX  RE  4 

UU  EH  6 

IH  L S 

lY  RE  S 

R EY  3 

lY  R lY  3 

EH  R IH  1 

RE  L 1 

UH  R EL  1 

lY  UH  EH  1 

UU  IH  1 

lY  RX  9 

RR  L RX  2 

ER  IH  3 

RE  n IH  1 

U RX  3 

R RE  7 

EL  RX  4 

•ER  RX  5 

R RX  9 

RY  RX  1 

U UH  1 

Y UU  ER  1 

RE  R S 

EH  L RX  2 

UU  RO  L 1 

U IH  2 

L OU  1 

UU  RX  1 

RR  R RX  1 

IH  N 1 

N RY  1 

R EH  7 

R OL  5 

ER  R EH  1 

lY  R En  1 

lY  EH  6 

lY  RR  R 1 

L EY  IH  1 

IH  R 4 

UU  Y UU  HH  RE 

1 

RY  EH  1 

Y ER  1 

UU  ER  1 

R lY  3 

RR  R EH  1 

RX  OU  1 

RE  R RO  2 

RR  L 1 

ER  L RV  1 

L RE  1 

RE  R EH  3 

L IH  3 

UU  HH  RE  1 

UH  IH  1 

HN  IH  R 1 

OH  RX  1 

RX  0 RX  1 

ER  lY  2 

EY  RY  2 

IH  L lY  1 

UH  R IH  2 

RX  U EH  R 

1 

UH  L RV  1 

EH  R lY  1 

R EH  8 

RE  R RX  1 

EH  T ER  1 

Y UU  EL  1 

OH  RX  RE  1 

R EL  1 

RX  EY  3 

RX  R RX  H 

1 

EY  IH  1 

RY  R RX  1 

UH  L 1 

OU  R lY  1 

lY  0 lY  1 

lY  R 1 

RO  R lY  1 

RR  L IH  2 

R EY  IH  1 

R RR  1 

RE  R lY  1 

RE  L RX  1 

lY  IH  1 

L EH  1 

ONSET  LEXICON 

P 91 

T 88 

K 64 

B 6 

D 8S 

G 24 

F 48 

TH  18 

S 97 

SH  24 

ZH  2 

V 17 

OH  87 

2 18 

n 82 

N 44 

L 47 

R 29 

HH  43 

UH  S3 

y 9 

U 23 

n Y 2 

N Y 12 

P L 9 

P R IS 

P Y 3 

T SH  22 

T R 13 

T U 1 

T Y 1 

K L 3 

K R 1 

X y 1 

K U 4 

B L 2 

B R 2 

B Y 2 

0 2N  13 

0 R 3 

OVA 

G L 2 

C R 5 

G U S 

HH  V 2 

F R 2 

TH  R 2 

S P 4 

S T 24 

S K 13 

s n 1 

S N 1 

S L 2 

V Y 1 

SPL  1 

S P R 1 

S T R 

11 

S X R 1 

COOR  LEXICON 

R 19 

L 13 

n 22 

N 25 

NX  S3 

P 14 

T 133 

K 33 

B 26 

D 35 

C 8 

F 10 

TM  1 

S 42 

SH  19 

V 62 

DH  3 

Z 121 

ZH  1 

R H 1 

R N 1 

L P 1 

It  P 1 

S P 1 

R T 16 

N T 14 

K T 6 

S T 26 

SH  T 4 

NX  X 3 

S X 2 

R 0 2 

L 0 1 

N 0 13 

V 0 1 

2 D 1 

L F 1 

R TH  1 

N TH  2 

P TH  1 

RSI 

LSI 

N S 4 

P S 1 

T S 7 

K S 19 

FSI 

T SH  17 

L V 1 

R Z 1 

L Z 1 

N Z 3 

N 2 14 

N XZ  2 

B Z 1 

0 Z 2 

V Z 2 

0 ZH  9 

R N T 2 

N S T 2 

K S T 4 

RTS 

2 

N T 5 2 

K T S 12 

S T S 1 

L K S 1 

NX  X S 

1 

R T SH  2 

N T SH  2 

R n Z 1 

N 0 Z 2 

L V Z 

1 

R 0 ZH  1 

Sagmsfit  Class  Labals 


ASF  • aspifalion  or  fricalivo 

ASH  - aspiration  or  Mfh  onorsy  fricstivo 

ASL  - aspiration  or  low  snorfy  fricalivo 

ASP  - atpirslian 

BAR  • voico^ar 

DCN  - dip  position  consonant 

FLB  - voico-bar  or  flap 

FLP  - flap 

FRC  - fricalivo 

FRU  - unvoicsd  fricalivo 

FRV  - voicsd  fricalivo 

FSI  - fricalivo  or  silonoo 

FVB  - final  voioo-bsr  or  nasal 

6L0  - liquid  or  flido 

HFR  - hi(h  onoriy  frkalivt 

HHV  - HH  or  Y 

HUF  - hifh  anorty  unvekod  fricalivo 
HVF  - hi|lt  onorfy  voioad  fricalivo 
LFR  - low  anorty  fricalivo 
LUF  - low  anorty  unfoicod  fricalivo 
LVF  • lew  anorty  voiced  fricalivo 
NAS  - nasal 

NGL  • nasal,  tiido,  or  liquid 
NVF  - nosal  or  voicsd  fricalivo 
PLS  - lew  anorty  spooch  sound 
SIL  • silsnco 

SPL  - senorani  voicint  ininimvm 
SPP  - aonoranl  voicint  peak 
TCN  • bn  poailion  censoranl 
FRU  - unvekod  fricalivo 
VAA  - AA-IHw  vowol 
VBK  - back  vowol 

VCN  - non-nucissr  vowol  or  rssonensnl  censonani 
VFT  - front  vowol 


Appendix  B:  Lexicons 


105 


VIV  • lY.lik*  vewvl 
VLW  • low  vowil 
VMD  • mid  vowat 
VSW  - ■chwa>lika  vowtl 
VUW  - UW-lika  vowat 


The  2000-Word  Lexicon 

The  code  following  each  word  indicates  the  smallest  vocabulary  in  which  the 
word  is  found:  (T)  the  word  occurs  in  the  test  utterances,  (5)  500-word  vocabulary,  (1) 
1000-word  vocabulary,  and  (2)  2000-word  vocabulary. 


n T 

RBDlCRTION/2 

RBJURE/2 

RBOniNRTE/2 

RBOUT/T 

RBSENT/2 

BBSTRflCT/T 

R6STRRCTI0N/1 

RBSTRRCTS/T 

RBUT/2 

RCCEPTRNCE/2 

RCCOnPL ICE/2 

nccusnTiON/2 

RCIOITY/2 

RCL/T 

RCn/1 

RCQUISITION/1 

RCRIOITY/2 

RCTlONS/1 

ACTIVE /5 

RCTURL/2 

RCYCLlC/1 

RDRPTRTION/5 

flOnPTIVE/5 

flDDICT/2 

RDDITIOH/5 

RDDRESS/S 

ROORESSES/T 

ROJOIN/2 

R0I1IRER/2 

RDRENnL/2 

RDVERSRRy/2 

RDVISING/5 

RERRTION/2 

RESTHETICS/5 

RFRR/2 

flEFILinTIOHrS 

RFFILIRTIONS/S  flFFLUENCE/2 

RFTER/S 

RFTERBIRTH/2 

RCCRRNniZE/2 

nG0«'2 

RI/T 

flILnENT/2 

RLBINO/2 

RLEnBlC/2 

RLCEBRRIC/5 

BLCOLr'S 

RLCORITHH/T 

ALCORITHHIC/S 

RLlnEHTflRy/2 

RLL/T 

RLL-OR-NONE/5 

niLEH/T 

flLLOCRTION/2 

RLOHE/S 

RLSO/T 

RLTERRTION/2 

RLURYS/T 

m/7 

RnRH/2 

RlinERGRlS/2 

RnENns/2  ' 

RnnONITE/2 

RI10NG/1 

nnPHQRn/2 

RH/T 

RNnL/2 

RNRLOCY/5 

RNRLYSIS/T 

RNRLy2ER/l 

nNRTOnV/'E 

RHD/T 

RMESTHETIST/2 

RNinRL/2 

RNN/1 

RNNIHILRTE/2 

RNOnnLDUSi'2 

RHOTHER/5 

RNSUER/T 

RNSUERING/1 

RNTENNR/2 

RNTHONV/1 

RNTICHBIST/2 

fiHTiaUITY/2 

RMY/T 

ANYONE /5 

RNVTHING/T 

RNVUHERE/T 

RPflRTHEIO/2 

RPlECE/2 

RP0THE05IS/2 

RPPERR/l 

RPPERREO/T 

RPPEHOIX/2 

RPPLlCRTION/1 

RPPOINTnENT/2 

RPPRENTlCE/1 

RPPRORCH/1 

RPPR0XinRTI0N/2RPRIL/5 

RRR0ESQUE/'2 

RRBlB/5 

RRCnDE/2 

RRCHlTECTURE/2 

RRE/T 

RRER/T 

RRERS/T 

flREH*T/5 

RRIIISTICE/2 

RRPR/5 

RRREST/2 

RRT/T 

RRTICLE/T 

RRTICLES/T 

RRTICULRR/2 

RRTIFICIRL/T 

RRTS/l 

RSCENO/2 

Rsinov/i 

RSIHINITY/2 

flSK/T 

RSSniL/2 

RSSEHBLY/T 

RSSERTIONS/1 

RSS1I11LRTE/2 

RSSIHILRTIQN/T  RSSOCIRTION/T 

RSSOCIRTIVE/T 

RSTHtlflTIC/2 

RSTROPHYSICS/2 

RT/T 

RTONIC/2 

RTTR IMRB IL I TY/2flTTENTI0K/5 

RTTRIBUTRBLE/2  RUOtO/2 

RUGtlENTEDrT 

RUGUST/1 

RURRR/2 

RUTHOR/1 

RUTH0RITRTIVE/2RUTH0RS/T 

RUTOWRTED/'T 

RUTOnOTlC/T 

RUTonnrioN/T 

RUTOtlOTIVE/2 

RVRILflBLE/5 

RVERnENT/2 

RVnUCHr2 

RURRO/l 

RUN/2 

RXIOHRTIC/T 

RXIOnS/T 

RZRIEL/T 

BRBBLE/2 

BRCKGRimOM/T 

BRr.TERIOLOCY/2  BRIL/2 

BRLORIC/2 

BRLSRn/2 

BRHDSTflNDr2 

BRNERJI/1 

BRNK/l 

BRNTER/2 

BRRON/2 

BRRROU/l 

BRRTENnER/2 

BflSE/1 

BRSEBRLL/l 

BRSEO/T 

BRSES/1 

BRS50/2 

BRTES/1 

BflTOH/2 

BRY/T 

Bfl200Kfl/2 

BERSr/2 

6ECK0N/2 

BEEF/2 

BEEN/S 

BEFORE/S 

BECONIR/2 

BEHRVIOR/l 

BELIEF /T 

BELIKE/2 

BENEnTH/2 

BEN20L/2 

BERKELEV/T 

BERLlNER/6 

BERNRRO/l 

BERT/1 

BEST/2 

BETUEEN/1 

BE2aRR/2 

BlDDEN/2 

BIC/1 

BJLL/T 

BILLET/2 

BINOINC/1 

BINDINGS/T 

BlOCRRPHy/2 

BIOIIEDICINE/T 

BlSEXUflL/2 

BLRB/2 

BLRRHEY/2 

BLEOSOE/1 

BLEniSH/2 

BLITHE/2 

BLOCK/1 

BLOSSOIt/2 

BLUEPOIHT/2 

BOnTHOUSE/2 

BOBROU/5 

Booy/2 

B0NDI1RH/2 

BONHIE/1 

BOOK/5 

B00i:S/5 

BOOIirOUN/2 

BORNE/2 

BOUFFRNT/2 

BOUHnS/5 

BOUTOHHIERE/2 

BRRr.E/2 

BRRIN/T 

BRRNn/2 

BRR2IL1RN/2 

SREECH/2 

BRlG/2 

6R1SKET/2 

BROIL/2 

BROTHER/2 

BRUCE/5 

BRUTRLITY/2 

BUCIIflHRH/5 

BUOGERICRR/2 

BULCE/2 

BUNCH/2 

BURERU/2 

BURnnSE/2 

BUSH/2 

BUSINESS/T 

BUT/1 

BUTTOCKS/2 

BY/T 

CRBIN/2 

CRCI1/5 

CR0EN2R/2 

CRI/T 

CRJUN/2 

CRLCULUS/S 

CRLF/2 

CRLLOU/2 

CRIIPUS/2 

CRN/T 

CRNOtORTE/2 

CRHHONBnLL/2 

CflHT!CLE/2 

CflPRDlLITIES/l 

CflPRClTY/2 

CRPRICE/2 

CRPUCHIN/2 

CRR/1 

CRRB0NIFER0US/2CRRFRRE/2 

CRRL/1 

CRRNIVORR/2 

CRRRy/2 

106 


CflRTOCRflPHV/'T 

CRSE/T 

CRSHIER/2 

CRSTICRTION/2 

CflTflLPR/2 

CRTERCORNER/2 

cnucnsinN/'2 

CRUSRL/T 

CRVRLCRnE/2 

CERSE/S 

CELflNOINE/2 

CELL/5 

CEnBRI.0<'2 

CENTENNinL/2 

CEHTURV/2 

CERTlFY/2 

CHRLCEDOHV/2 

CHflHrELLOR/2 

CHnPCL/'2 

CHRRGC/2 

ClIRRNIRK/l 

CHEflPCH/2 

CHECKER/1 

CHECKING/r 

CHELn/2 

CHESS/T 

CHEVRLIER/2 

CHlEF/2 

CHlN/2 

CHITLINS/2 

CHOLERIC/2 

CNOOSE/5 

CHOSEN/2 

CHRISTOPNER/1 

CHRONOCRRPH/2 

CHUCK/1 

CHURL/2 

CINNRBnR/2 

CIRCLE/T 

CIRCUIT/T 

CIRCUITS/1 

CIRCUrtLOCUTION/ 

CITE/T 

CITEO/5 

ClTES/1 

ClTY/2 

CLRN/2 

CLflSS/2 

aEnNSE>'2 

CLICHE/2 

CLIHBINC/S 

CLIPBOnRD/2 

CLOTHIER/2 

CLUSTERINC/T 

CLUTTER/2 

cnu/1 

COBBLE/2 

COCKNEY/2 

COOE/1 

CODING/5 

COEFFICIENT/2 

COGHITION/T 

COGHITIVE/T 

COCHOnEN/2 

COLBY/T 

COLES/1 

COLESLnH/2 

COLLEGinN/2 

COLLINS/T 

conB/2 

COHE/T 

COnET/2 

COnnEHDflTION/2  COnnENTS/l 

COHMlTTEE/l 

COintOH/T 

connoTioH/2 

COnnUNICRTION/l 

C0nnUHICflTI0NS/C0))PnRnBLE/2 

COMPILRTION/2 

COIIPLEX/T 

COnPLEXITY/l 

COnPOHENT/2 

COnPOHENTS/1 

COnPREHENSION/ICOItPUHCTION/2 

COIIPUTRTIOH/T 

COHPUTHTIONRL/TCOnPUTER/T 

C0I1PUTERB/1 

COIIPUTING/5 

COHCEPTURL/T 

CONCERN/S 

CONCERNED/S 

CQHCERNING/G 

CONCERTINfl/2 

CONCRETE/2 

CONCURRENT/1 

CONOOLE/2 

COHFEOERRCV/2 

CONFERENCE/5 

CONFERENCES/5 

CONFINE/S 

CONFORtlflTION/2  COHCRESSnflH/2 

COHHIVE/2 

COHSEHT/2 

CONSIDER/5 

C0N5IDERE0/5 

CONSOL IDflTE/2 

CONST I TUT ION/2  CONSTRRINT/T 

CONSTRUCTINC/1 

CONSTRUCTION/1 

CONSUL TRNT/1 

CONSUL TRTION/1  CONSULTRTIONS/TCONTRIN/T 

COHTRINED/T 

CONTRINS/S 

CONTEnN/2 

CONTEXT/1 

CONTINCNT/2 

CONTINUOUS/I 

CONTRRDICT/2 

CONTRIVE/2 

COHTROL/T 

CONTRQLLEO/T 

CONVENTION/S 

CONVENTIONS/S 

CONVERGE /2 

convoi:e/2 

COOPERRTING/1 

COOPERRTION/l 

COOPERflTIVE/2 

COPULRTION/2 

COPY/T 

COPYING/! 

CORPORnTE/2 

CORRECTNESS/T 

CORRUGRTE/2 

cosnic/2 

COTTOH/2 

COULO/5 

COUHTERFEIT/2 

COUPLING/2 

COVER/2 

COVOTE/2 

CRRNIUn/2 

CRnvaN/2 

CREDO/2 

CRESCENT/2 

CRiniNOLOCY/2 

CROCODILE/2 

CROUCH/2 

CRUET/2 

CRVPTIC/2 

CUISINE/2 

CUHEIFORft/2 

CURlO/2 

CURRENT/S 

CURUnTURE/2 

CURVEO/1 

CUTICLE/2 

CYBERNETICS/1 

CYCLlC/1 

CYCLOTRON/2 

oncHn/2 

OflLLV/2 

DflNOLE/2 

DRNNY/S 

DRTfl/S 

ORTE/T 

ORTES/l 

DRVE/I 

DRVIO/1 

OOT/2 

DEB/2 

OEBRTE/T 

OECflLCOnnNIfl/2 

OECEnOER/S 

OECIPHER/2 

DECISlON/1 

DEC0nP0SITI0N/2DEDUCE/2 

OEDUCTION/T 

OEOUCriVE/1 

OCFERIIENr/2 

DmomE/2 

OELIBEHRTE/2 

DELVE/2 

0EtinHn/5 

DEnaCRRT/2 

DEN/2 

OENOTRTIONRL/1  DENTIST/2 

DEPDRTnEHT/2 

OEPTH/1 

DERELICT/2 

DERIVRTION/1 

DESCRIBE/T 

DESCRlPTION/1 

DESCRIPTIDNS/T 

OESECRnTION/2 

DESIGN/1 

DESIRE/5 

DESPERRTE/2 

DETnCH/2 

DETECTIOH/1 

DETRflIN/2 

DEVICES/5 

DEVITRLIZE/2 

DIRGNOSE/2 

DIRGNOSIS/1 

OIRLOGUE/l 

OIARRHOER/Z 

OICK/1 

OICTtm/2 

DIO/T 

OIDN'T/T 

DICIT/2 

DIt1/2 

DinENSIONRL/l 

DINKY/2 

DIRECTED/5 

OISRSSEnDLE/2 

DISCONBOBULRTE/ 

DISCUSS/T 

DISCUSSEO/T 

OISCUSSES/T 

DISCUSSING/5 

DISEUSE/2 

DISLOCRTE/2 

DISPLRY/l 

DISPDSRL/2 

DISREPUTE/2 

DISSOLUTE/2 

DISTRlCT/2 

DIVERSE/2 

OO/T 

DOCENT/2 

DOES/T 

DOESN’T/S 

DOFF/2 

DDLOR/2 

oonniN/1 

DOH'T/T 

DONRLD/l 

OOHE/5 

DONNYBROOK/2 

DOTOCE/2 

DOUC/l 

DOUnGER/2 

ORRFTEE/2 

DRRGON/1 

DRRGOHS/1 

DRRUINGS/1 

ORnML/2 

DREH/t 

OREVFUS/T 

ORINK/2 

ORIVING/1 

DROSS/2 

DUnL/2 

OUENNR/2 

oumiv/2 

OUPLICRTION/2 

OURING/S 

OYHRrflC/T 

DYHRniTE/2 

CflCH/5 

ERRL/1 

ERRLIEST/5 

EflRNEST/1 

ERRTH/2 

ECCENTRlC/2 

ED/T 

EDINBURGH/1 

EDITORlRL/2 

EFFICIENTLY/T 

EGOISTIC/2 

EICHT/1 

ElCHTEEN/5 

EICHTY/5 

ELflTE/2 

ELECTR0t1RGHCT/2ELECTR0NIC/l 

ELECTRaNlCS/1 

ELEVEN/5 

ELIOE/2 

ELLlOT/1 

ELUSIVC/2 

EnBE22LE/2 

EnENn/2 

EHOT ION/2 

Enu/2 

ENCOnPRSS/2 

EHEnV/2 

ENGLI5H/5 

ENNnBLE/2 

EHTERITIS/2 

ENTROPY/2 

ENVIRONNENT/l 

EOH/2 

EPIPHRNY/2 

EaunL12RT10N/2 

EOUIPnCNT/2 

ERIK/1 

ER(lflH/T 

ERHST/1 

ERRRHD/2 

ESCHEU/2 

ESPY/2 

ETHER/2 

EUGENE/1 

EUGEHICS/2 

EVRLUfiTE/2 

EVRLURTION/l 

EVRLURTOR/l 

EVENTS/1 

EVER/T 

EVERLHSTlNG/2 

EVERV/1 

EVERYTHING/T 

EXflnPLE/T 

EXflHPLES/l 

EXCISE/2 

EXCURSIVE/2 

EXHnUST/2 

EXIST/T 

EXORBITRNT/2 

EKPERT/1 

EXPERTISE/2 

EXPLRNRTION/1 

EXPLORER/2 

EXPRESSlONS/1 

EXPRESS)inH/2 

EXTIRPRTION/2 

EXTRnVRGnN2R/2 

EYESTRRIN/2 

FRBLES/1 

FRCES/l 

FHClNC/2 

FHCTS/5 

FHHLnnH/1 

FflILURE/2 

FRIRY/1 

FnL8ETT0/2 

FRNDnNCD/2 

FRRRRGO/2 

FRSTENER/2 

FflSTER/T 

FRTUOUS/2 

FEnTURE/2 

FERTURE-DRIVEH/FEBRUnRV/S 

FEOERflL/1 

FEEOBRCK/2 

FEIGENBRUn/T 

Appendix  B:  Lexicons 


FELOnnH/T 

FELONY/2 

FERN/2 

FETCH/2 

FICHU/2 

FICTION/1 

FIFTEEM/5 

FIFTY/S 

FIGHT/2 

FIKES/1 

FILE/5 

FILL INC/2 

FlNO/2 

FINISH/S 

FINISHEO/S 

FIRST /S 

FIVE/5 

FIXnTION/2 

F0CEV/'2 

FONOU/2 

FOR/T 

FORBODE/2 

FORElGH/2 

FORESTS/1 

FORETOKEN/'E 

FORHRL/S 

F0RnnL12E/2 

FORItRTION/T 

FORTHRIGHT/2 

FORTV/5 

FOUND/E 

FOUR/S 

FOURTEEN/5 

FRRCRRNT/2 

FRflllE/B 

FRRHES/l 

FRflUDi'E 

FRENETIC/2 

FRIGHT/2 

FRon/r 

FROHTflCE/2 

FRUITIOH/2 

FU/l 

FUNCTION/S 

FUNCTIONS/1 

FUNDRIICNTRL/2 

FURRIER/2 

FUZZY/I 

CflOFESr/2 

6RINSAV/2 

eauivm/2 

CRHE/T 

CRHES/l 

CRHnER/2 

GRP/2 

GRRRI50N/2 

GRRY/l 

GRSCHNlG/l 

CnTHER/2 

GnZETTE/2 

GEHERLOGV/2 

GENERRL/l 

CENERRTE/T 

CENEROr ION/1 

CENUS/2 

GEOHETRIC/l 

GEORCE/1 

GERniNRTION/2 

GET/T 

GHOST/2 

6inCRRCC/2 

C IPS/1 

GIVE/T 

GIVEN/T 

GLRBRaUS/2 

GLHZE/2 

GLOBRL/2 

CLUE/2 

GM/1 

CO/T 

CO-nOKU/1 

GORL/l 

CORLS/1 

GORT/Z 

GOLDFlNCH/2 

GOPHER/2 

COUCE/2 

GRRDIENT/2 

GRRIN/S 

GRRnnRRS/l 

GRnnnnricRL/T 

GRRNOIOSE/2 

CRRPH/T 

CRflPHlCS/T 

GRRTIFICRTI0N/2GRfl2E/2 

GREU/2 

6RIP/2 

GROnMET/2 

GROUNDNRTER/2 

GRUNT/2 

GUILOER/2 

GUN/2 

GU2aE/2 

HnBlT/2 

HnOES/2 

HflLVCRS/2 

HRHBURG/l 

HflNDOUT/2 

HRHS/5 

HRPHn2RR0/2 

HRPPEN/1 

HflRLOT/2 

HRRRY/S 

HflS/T 

HRSN’T/S 

HflSP/2 

HRUTEUR/2 

HRVE/T 

HHVEH'T/5 

HflVES-ROTH/1 

HE/5 

HEROnnN/2 

HERRSRY/S 

HEflTH/2 

HEOONlSn/2 

HELICES/2 

HELP/T 

HELPmTE/2 

HEHCmmi/2 

HENORIX/1 

HER/S 

HERB/5 

HERBERT/1 

HEROStinN/2 

HERniT/2 

HETEROSTRTIC/l 

HEURISTIC/5 

HEM ITT/1 

HEUH/2 

HinOEH/2 

HILRRY/l 

HILL/5 

HILLOCK/2 

HIRSUTE/2 

HIS/1 

HISTORY/1 

HORK/2 

HOLINESS/2 

HOLLRND/T 

H0I1ILETIC/2 

HOHORRRIUn/2 

HOPC/2 

HORRlO/2 

HOSPITRLIZE/2 

HOUSEBORT/2 

HOH/T 

HUGH/l 

HUHnN/1 

HUHRNC/a 

HunnocK/2 

HUNOREO/5 

HUNCnV/1 

HUNT/1 

HURRRH/Z 

HYBRlD/2 

HYOROUS/2 

HYPHENflTIOH/Z 

HYPOTHESIS/1 

HYSTERlR/2 

1/T 

ro/T 

I’H/S 

lCY/2 

IOlOnflTIC/2 

IEEE/5 

lFIP/5 

ICHOniNIOUS/2 

IJCBI/T 

ILLINOIS/1 

ILLOGICRL/2 

IHRCE/l 

IHRGES/l 

inBRLRNCE/2 

innRTURE/2 

innOOEST/2 

inMUTflBLV/2 

inPRTIEHT/2 

inPERIRL/2 

inPIETY/2 

inPORT/2 

inPRRCTICRL/2 

inPROBRBLE/2 

inPROVINC/1 

inPUTRTION/2 

IN/T 

INnRTICULRTE/2 

INCflRHRDIHE/2 

INCISE/2 

INCOHtlODE/2 

I NCREO I B IL I TY/2 INCUnBENT/2 

INOEFINITE/2 

INDICflTE/2 

IH0ISPEHSnBLE/2IN0WCE/2 

WOUCTlVE/l 

mvsTnm/1 

INELICIBLE/2 

INEXRCT/1 

INFflLLIBLE/2 

INFERENCE/T 

INFERENCES/1 

INFERENTIRL/l 

INFEST/2 

INFORnRL/2 

INFORHATiaH/T 

INGRflIN/2 

INHERIT/2 

INHERITflNCE/1 

INJURE/2 

INNERnOST/2 

INPUT/2 

INSRNE/1 

INSECTIV0R0US/2INSIGHT/2 

INSPIRRTION/2 

INSTIL/2 

INSTITUTE/1 

INSULflTOR/2 

INTEGRnTE/2 

INTELLICEHCE/T  IHTELLIGEHT/1 

INTEHSITV/1 

INTENTlONS/1 

lNTERflCTlON/2 

INTERflCTIVE/1 

INTERESTEO/T 

lNTERJECTlON/2 

INTERHEOIRRY/Z 

INTERNUNCIO/2 

INTERPRETRBLE/lINTERPRETIVE/l  INTERRUPTS/1 

INTERTIOflL/2 

INTERVIEU/1 

INTONRTION/1 

INTR0SPECT10H/2INVRRIRBLE/2 

lNVRRlRNCE/1 

INVRRIRNCES/l 

INVESTHENT/l 

INVIOIOUS/2 

INVOCRTION/5 

INVOLUTE/2 

IRRSCIBLE/2 

IRRECOHCILRBLE/IRRESOLUTE/2 

IRV/1 

IS/T 

ISLRNn/2 

ISH'T/1 

ISOHERS/l 

ISSRC/5 

ISSUE/5 

ISSUEO/S 

lSSUES/5 

IT/T 

ITRLIflN/2 

lTERRTION/1 

ITS/5 

JRCK/1 

JRCKLEG/2 

JRHES/1 

JRNURRY/S 

JRR/2 

JERN/l 

JEFFREY/1 

JEJUHE/2 

JERRY/T 

JEUEL/2 

JITTERS/2 

JOHM/T 

J0LLIFICflTI0N/2J0SEPH/l 

JOORNRL/T 

JOURNBLS/T 

JOY/2 

JUOER/1 

JUOICIRL/1 

JUICY/2 

JULY/5 

JUNE/S 

JUPITCR/2 

KRFFEEKLRTSCH/EKRRL/l 

KEBOB/2 

KEITH/1 

KCH/T 

KEY/5 

KEYS/5 

K18BUTZ/2 

KILL/G 

KINO/1 

KINDLE/2 

KINDS/5 

KING/1 

KITCHENUflRE/2 

KNICKNRCK/2 

KNOU/5 

KNOULEOCE/S 

KNONN/1 

KORERN/2 

KUGEL/l 

LRDOR/2 

LRRS/1 

LnCNlRPPE/2 

LRnBDR/l 

LRI1INR/2 

LRNOSnnN/2 

LRNGURGE/S 

LRNGURCES/5  . 

LRPPET/2 

LRRCE/1 

LRSH/2 

LRST/T 

LRTELY/T 

LRTEST/S 

LRTRINE/2 

LRURENT/1 

LRVISH/2 

LER/2 

LERRNED/2 

LERRNIHC/T 

LECTURES/1 

LEE/T 

LEER/2 

LECEROENRIN/Z 

LEK/2 

LENRT/1 

LEONRRD/T 

LEPRECHRUH/2 

LES/1 

LESSER/1 

LET/S 

LET'S/5 

LETTOCE/2 

LEXICOHETRY/l 

LEXICON/2 

LIBlOINOUS/2 

LIE/2 

LIGHT/l 

LICHTFRCE/2 

LIKE/T 

LinBO/2 

LiniT/S 

LinlTEO/1 

LINOR/1 

LINE/1 

LINERR/1 

107 


108 


LINEN/2 

LINGUISTICS/T 

LINTEL/2 

LISP/1 

LIST/T 

LISTEO/S 

LlSTEN/2 

LISTING/S 

LITHOCRRPNV/2 

LIVELIHOOO/2 

LORn/2 

LOCflTE/2 

LOCRTION/1 

LOCRTIONS/1 

LOGlC/1 

LOCICRL/1 

LOHG/T 

LONGlNC/2 

LORCHETTE/Z 

LOSING/1 

LOVC/2 

LOU/1 

LUCID/2 

LUnBER/2 

LURIO/2 

LYRE/2 

HRCHINE/l 

nRCIIlNES/1 

nRCt;iNRU/2 

nncRO/1 

nnoCLINE/l 

nncNnn/2 

nRCflZINES/1 

nRCNETIC/2 

nRHOCRNY/Z 

nRJORITY/2 

nni:E/s 

nRLEFflCT ION/2 

nflLLCT/2 

nflNRGCflRLE/2 

nnNRGEnr.NT/1 

nflNCnNESE/2 

nnHIFOLO/2 

nRNIPULRTING/1 

HRNIPULRTORS/T  HRNNR/l 

nnNSLflUGHTER/2  HRNTRR/l 

nnNv/T 

nnpLE/2 

nRPPING/l 

HRRCH/S  - 

nnRINRTE/2 

RRRKET/l 

nnRauETRV/2 

HRRR/l 

HRRSLRNO/l 

nflRTELLI/1 

nORTYR/Z 

nHRWIN/T 

nnRV/i 

HRSINTER/l 

nflSSnCHUSETTS/inRSSEUR/2 

nflT/2 

nflTCHING/T 

nnTRIRRCHnL<'2 

nRUL/2 

noY/s 

ncCRRTHY/l 

nCCORDUCK/l 

nCOERHOTT/l 

nE/T 

nERNING/l 

nERHS/l 

nERSURE/2 

nEOIRTION/2 

HEDICRL/l 

nEEK/2 

REETINC/S 

nEETINCS/S 

nELODIC/2 

HELTZER/l 

HEHORIES/T 

nEnoRv/1 

nENnGE/2 

nENTION/T 

HENTIONEO/T 

nENTIONING/5 

HENTIONS/T 

RENTOR/E 

nENU/5 

HENUS/T 

nERlNCUE/2 

nESSIRH/2 

HETR-SYnBOLIC/l 

nETRnRTHEnRTICSnETRPHORICRL/2 

HETHODS/l 

nETRE/2 

niCHflEL/5 

niCHRLSKI/l 

HlCHJE/l 

HlCROnETER/2 

niDSUNnER/2 

niKE/1 

niLIEU/2 

niLLINC/2 

niNERflL/2 

niNIHRL/l 

niHEER/1 

niNORlTY/2 

niNSKY/T 

niRTH/2 

niSCELLRHV/2 

niSFILE/2 

niSRULE/2 

HlSTRlRL/2 

niTCHELL/1 

HLISP/l 

nLlSP2/l 

nOOILE/2 

nODEL/1 

noOELING/l 

noocLs/1 

nOOCRNLY/2 

no I ST/2 

nOLLlFY/2 

noNnsTicisn/2 

nONITOR/T 

nONKEY/1 

nONOTONE/2 

nOHTH/S 

nONTNS/5 

nOOR/2 

nORRTORIUn/2 

nORE/T 

noRPHcnic/2 

nORTIFICRTION/2naST/5 

nOSTOH/I 

nOTILITY/2 

nOTION/l 

nOURN/2 

novE/i 

nOVEnENTS/1 

noviES/i 

nUCK/2 

nULRTTO/2 

nULTILEVEL/1 

nULTIPLE/2 

nULTIPROCESS/1 

nUROER/2 

nusic/T 

nUSKETEER/2 

nUST/T 

nUTRTIVE/2 

nYNRH/E 

HYSELF/S 

N/2 

NflCEL/1 

NRP/2 

NRRRnTOR/2 

NflSH-UEBBER/l 

NRTIONRL/l 

NRTIONRLIZRTIONNRTURRL/S 

NRVRL/Z 

NERTH/E 

NECROSIS/2 

NEGOTIRNT/2 

NEOPLRSn/2 

METS/5 

NETTING/2 

NETUORC/T 

NETUORKS/T 

NEURRL/T 

NEVER/2 

NEU/T 

NEUBORN/l 

NEUCOnER/l 

NEHELL/T 

NENEST/S 

NEUEY/1 

NEUSLETTER/5 

NEKT/5 

MlCETY/2 

NIH/1 

NILS/T 

NILSSON/T 

NINE/1 

NINETEEN/S 

NINETY/S 

NITRRTE/2 

NO/T 

NODDV/2 

NOniNRTING/l 

NOniNRTION/l 

NoniNEES/1 

NON-INOEPENOENTNOUnGENRRIRN/2 

NOUDETERn IN 1ST INONPRRT ISRN/2 

NOR/2 

MORI/1 

NORMRN/l 

NOSrRIL/2 

NOT/5 

NOTES/l 

NOTORIETY/2 

NOVEnOER/S 

NOURY/2 

NRL/1 

NUGCET/2 

NUN/2 

0/2 

OBEY/2 

OBJECT/1 

OBJECTS/1 

OBLITERRTE/2 

OBSTflCLE/2 

OCCLUOE/2 

OCTOBER/5 

OCTOPUS/2 

OF/T 

OFFENSIVE/2 

OHLRNOER/1 

OHn/2 

OK/l 

OLOEST/T 

OnELETTE/2 

ON/T 

ON-LINE/1 

ONE/T 

OKES/1 

ONLY/T 

ONSLnUCHT/2 

ONTOGENY/1 

OPERR/2 

OPERRTIONRL/1 

0PP051TI0N/2 

OPTIHRL/l 

0PTini2E0/l 

OR/T 

ORRNGUTnH/2 

ORDER/1 

OROERS/1 

OROINRL/2 

0RCnN12flTI0H/l 

ORGV/2 

ORIENTEO/1 

ORNERy/2 

OTTOnnH/2 

OUR/S 

OURSELVES/5 

OOTCRV/2 

OUTLflNaiSH/2 

OUTRUN/2 

OVRTION/2 

0VERE)(PaSE/2 

OVEHLRYS/T 

OVERPRSS/2 

OVERTURN/2 

0XI0I2E/2 

pnc»:/2 

PRCLET/l 

PRGODR/2 

PflIR/1 

PHLRTINE/2 

PflLLIRTION/2 

PRnELR/l 

PRnPR/2 

PRHEL/2 

PRNTOninE/2 

PflPER/T 

PRPERS/T 

PflPERT/1 

PflPRIFR/E 

PRRRLLEL/2 

PRRRLLELISn/1 

PRRRNOIR/1 

PRRflPHRRSE/1 

PRRRTYpHOIO/2 

PRRIRH/E 

PnROLE/2 

PRRRY/l 

PRRTHENOGEHESISPRRTIRL/l 

PflS/2 

PRSCRL/l 

prssi:ey/2 

PRSTURE/2 

PRT/1 

PRTHFINOER/1 

PRTIENCE/2 

PRTTERH/5 

PRYnENT/2 

PERRL/1 

PECCflRV/2 

PEOICRB/Z 

PELRGE/2 

PENDnNT/2 

PENNIES/2 

PENULTinRTE/2 

PERCENT/2 

PERCEPTlON/1 

PERCEPTRONS/5 

PERFECTIOH/2 

PERFORIIRHCC/1 

PERIPNRRSIS/2 

PERniS5IBLE/2 

PERRY/1 

PERSECUTE/2 

PERSPECTIVE/2 

PERVERT/2 

PETER/1 

PETITE /2 

PHRLflNGCS/2 

PHRSE/2 

PHILOSOPIIER/2 

PIIOSPIIRTE/2 

PHOTOCRRnnETRY/PHOTOSYNTHESIS/PHRRSE/T 

PHRR5ES/5 

PHYSICIRNS/5 

PIRHIST/2 

PICKLOCK/2 

PICTURE/T 

PlECE/1 

PIFFLE/2 

PILL/2 

PINCUSHION/2 

PINGLE/l 

PIONEER/2 

PlTUITRRY/2 

PLflCIRRIZE/2 

PLRNES/1 

PLRNNER-L IKE/S 

PLRNNING/1 

PLRNS/1 

PLRNTRTION/2 

PLflTYPU8/2 

PLRYING/T 

PLERSE/T 

PLEBS/2 

PLOD/2 

PLUnY/2 

PMEUnnTIC/2 

POlNTEO/2 

POKER/1 

POLICV/2 

POLLOCK/2 

POLYHEDRR/1 

POLYP/2 

PONCHO/2 

POPE/2 

PORCUPINE/2 

PORTEHTOUS/2 

POSlTlVE/2 

Appendix  B:  Lexicons 


109 


Tt 


POSTERlOR/2 

POSTHRR/2 

POTPOURRI/2 

POK/2 

PRflSEOOYniUn/2  PRECRRIOUS/2 

PRECISION/Z 

PREOlCRnENT/2 

PRCDICRTE/5 

PREFERRBLV/2 

PREFERENTIRL/T  PREPONOCRRTE/2 

PRESEHT/T 

PRESENT inENT/2 

PRESUnPTIOH/2 

PREVnRICflTE/2 

PRICE/1 

PRlCE'S/1 

PRjnRRV/2 

PRiniTIVES/1 

PRIHT/S 

PRIHTEO/5 

PRINTING/S 

PRIOR/2 

PRIVY/2 

PROBLEH/r 

PROBLEHS/T 

PROCEOURRL/l 

PROCEOURES/T 

PROCEEDING/5 

PROCEEDINGS/S 

PROCESSES/1 

PROCESSINC/1 

PROCESSlON/2 

PRODUCE/5 

PROOUCCD/5 

PROOUCTION/T 

PRODUCTlVlTV/1 

PROEn/2 

PROGENY/2 

PROCRRH/T 

PROGRRHHING/T 

PROGRRHS/T 

PROGRESS/1 

PROLIX/2 

PROnULGnTE/2 

PROOF/T 

PROOFS/T 

PRDPERTIEO/2 

PROPERTIES/1 

PROPULSIOH/2 

PROSPERnUS/2 

PROTESTRTION/2  PROTOCOL/1 

PROTOCOLS/I 

PROVEHIEHCE/2 

PROVER/1 

PROVINC/T 

PROXiniTr/2 

PSYCHIRTRI8T/2 

PSVCNOLOCV/S 

PUDLlSH/5 

PUBLISHEO/T 

PUDLISHER/S 

PUDLISHERS/S 

POCK/2 

PULE/2 

PUHPERNICREL/E 

PUNT/2 

PURIFIER/2 

PURPOSE/1 

PURSUE/2 

PUTNnn/1 

PUTRlOITY/2 

QURORUPLE/2 

QURNTUn/2 

QUEEN/2 

OUERIES/T 

QUESTION/I 

QUIETUS/2 

QUIT/5 

QUIXOTIC/2 

OUOTE/T 

OUOTED/T 

RRBBI.E/2 

RROIRTE/2 

RROIO/1 

RRCE/2 

RRJ/5 

RRKISH/2 

RRLSTOH/l 

RRNRN/1 

RRNCIOITY/2 

RRPHRCL/l 

RRPIDLV/2 

RRSPBERRY/2 

RflTTnM/2 

RRY/2 

RRVHOND/l 

RERL-UORLD/l 

RERLISn/2 

RERSONINGA 

REBEL/2 

RECEIPT/2 

RECENT/T 

RECENTLY/S 

RCCITE/2 

RECOCNinONA 

RECONNOlTER/2 

RECRUIT/2 

REODEN/2 

REOOY/T 

REOUCTION/5 

REED/1 

REEVE/2 

REFER/T 

REFERENCE/5 

REFERENCEO/S 

REFERENCES/5 

REFERREO/T 

REFERRlNC/r 

REFRnCTORV/2 

REGRROING/T 

RECENCV/2 

REGlON/l 

RECRESS/2 

REGULRRLV/l 

REINCRRHRTI0N/2REITER/1  ' 

RELflTE/1 

RELRTEO/T 

RELRTES/1 

RELRTlONRL/l 

RELERSED/5 

RELUCTflNCE/2 

REnONSTRRTE/2 

<<ENECRDE/2 

REPERL /2 

REPORT/5 

REPORTER/1 

REPORTERS/1 

REPORTS/S 

REPOSE /2 

REPRESENTATION/ 

REPRESENTING/1  REPUDIRTE/2 

REOUEST/S 

RESCINO/2 

RESERRCH/1 

RESICNRTION/2 

RESOLUTION/T 

RESOURCE/1 

RESPITE/2 

RESPONSES/T 

RESTRICT/S 

RESTRICTlON/2 

RETIRE/2 

RETRIEVRL/S 

RETRIEVE/5 

REVEflL/2 

REVIEUS/S 

REVITALIZATION/ 

RNESUS/2 

RHOHBERG/l 

RlAL/2 

RICH/1 

RICHRRO/l 

RICK/1 

RIOICULaUS/2 

RIEGER/l 

RIESBECK/1 

Rin/2 

R’SEHRN/l 

RISEN/2 

RlVULET/2 

ROBERT/1 

ROBOT/T 

ROBOTIC/T 

ROBOTICS/1 

ROBOTS/1 

ROCHESTER/1 

ROO/2 

ROGER/l 

ROHflNTIC/2 

ROH/1 

ROSEHRRY/Z 

ROSENFELO/T 

ROTUNDfl/2 

ROUTE/2 

RUBIN/1 

RUBY/2 

RULE/1 

RULES/1 

RUnnHIRN/2 

RUMELHRRT/l 

RUSSET/2 

RUTGERS/1 

RYCHENER/1 

S-L-GRRPHS/1 

SnBOTRGE/2 

SRCEROOTI/1 

SRDDLE/2 

SRILOR/2 

SRLOON/2 

SRMHEr/I 

SflNCriTY/Z 

smemui 

SANK/2 

SAT/2 

SATISFRCTION/T 

SRTURDRV/2 

SflVOR/2 

SRY/l 

SCnNDINflVIRN/2  SCENE/1 

SCCNT/2 

SCHRNL/l 

SCIENCE/T 

SCIENTIST/2 

SCOTCH/2 

SCOTT/1 

SCREEN/2 

SCUFFLE/2 

SEnn/2 

SERRCH/l 

SECRET/2  . 

SEORTIVE/2 

SEEA  • 

SEEK/5 

SEEKING /I 

SEGHENTRT ION/1  SECRECRTION/2 

SELECTA 

SELTZER/1 

SEHRNTIC/T 

SEHRNTICS/T 

SEnBLnNCE/2 

SENO/2 

SENSEA 

SENSURLLV/2 

SENTENCE/1 

SENTENCES/T 

SEPTEHDER/S 

SEPTURCiNT/2 

SERF/2 

SERIRL/1 

SESflnE/2 

SESSION/1 

SESSIONS/S 

SEVEN/l 

SEVENrECN/1 

SEVENTV/5 

SEVERflL/1 

SEVnOUR/l 

SHnLL/2 

SHOPC/l 

SHRN/l 

SHRNL/Z 

SHE/1 

SHELLFISH/2 

SHINCLE/2 

SHOEHRKER/2 

SHOOTING/1 

SHORTLIFFE/1 

SHOULD/T 

SHOU/5 

SHOIIER/2 

SICKLE/2 

SlGflRT/5 

SIGNBOARO/2 

SIICLOSSV/1 

SILLV/2 

SinON/5 

SinPLIFY/2 

SlHULRTION/l 

SinULTRNEOUS/1 

S I nUL  TRNEOUSLV/S INCE  /S 

SIREN/2 

SIX/l 

SIXTEEN/S 

SIXTY/S 

SIZE/5 

SKIFF /2 

SLRCK/2 

SLflCLE/1 

SLflVEA/2 

SLEH/2 

SLITHER/2 

SLOH/T 

SLUICE/2 

snc/1 

SniLRX/2 

SniTH/1 

snuG/2 

SNRRING/1 

SNIOE/2 

SHUFFLE/2 

SO/S 

SOBEL/1 

SOCIRLIZE/2 

SOFTURRE/l 

SOLnRIUn/2 

SOLIOLV/2 

SOLOHRY/1 

SOLUTlONS/5 

S0LVIN6/T 

SOHE/T 

SOHETHING/S 

SOHEUHERE/l 

SONRTR/2 

SORCERY/2 

SORT/5 

SORTS/T 

SOURCCS/1 

SPACE /5 

SPRDE/2 

SPRNNINC/1 

SPnTE/2 

SPECI0U8/2 

SPEECH/B 

SPEED/T 

SPENT/2 

SPIRRL/2 

SPLENOIO/2 

SPONTRHEITY/2 

SPRRUL/2 

SPROULL/1 

SPUNC/2 

SRl/1 

STRC/2 

STnLHflRT/2 

STRNFORO/1 

STAPHYLOCOCCUS/STATE/l 

STATISTICIAN/2 

STERn/2 

STEPBROTHER/2 

STEREO/1 

STEVE/1 

STEVEOORE/2 

STinULUS/2 

STOCHRSTIC/l 

STOCK/1 

STGCKPlLE/2 

STOOL/2 

STOP/5 

STORAGE/1 

STORED/5 

STORlES/5 

STORY/S 

STRRIH/2 

STRERn/2 

STRlOINC/2 

STRONC/2 

STRUCTURE/1 

STRUCTUREO/1 

STRUCTURES/1 

STUBBORN/2 

STUOIES/1 

no 


STURGEOH/2 

SUBDUC/2 

8UBJECT/6 

SUBJECTS/6 

SUBPROBLEOB/l 

SUB6ELECT/6 

SUBSTflHTm/2 

SUBSYSTEH/l 

SUDVERT/2 

SUET/2 

SUITRBLE/2 

SUHEX/l 

sunnnRiES/5 

sunnnRi2E/2 

sunnnRY/s 

SUHC/1 

SUNSNINE/1 

SUPERRBUNDRNT/2 

SUPERUISE/2 

SUPPOSEO/2 

SURE/I 

SURCERy/2 

SURNOTES/5 

SURVEY/1 

SURVEVING/2 

SURVEVS/5 

SUSSEX/1 

SU2UCI/1 

SYFES/1 

SYLLRBICATION/Z 

SVnBOL/1 

SYnPHnMY/2 

SYNCHRaHIZRTIONSYNOHVn/2 

SYNTRCTIC/1 

SYNTflX/S 

SYNTHESIS/T 

SYNTHESIZER/1 

SYSTEH/T 

SYSTEHS/T 

TRBLE/2 

THCK/2 

TBKE/T 

TflLEHT/2 

TRLES/l 

TRnP/2 

TRNTRLUn/2 

TflRGET/2 

TflSr./'l 

TRTTIHC/2 

TRXOHOnV/2 

TECH-II/1 

TECHNICRL/1 

TECHHICIRN/2 

TECHNIOUES/1 

TECHNOLOGY/l 

TEO/1 

TELEGRRR/Z 

TELEOLOGICRL/1  TELL/5 

TELLTflLE/2 

TERPORnt/l 

TEHPTRTIOH/2 

TEN/5 

TENET/2 

TERniNRL/l 

TERtllNALS/l 

TERRINRTE/S 

TERniNRTION/l 

TERRIER/2 

TERRY/T 

TESTIFY/2 

TEXT/T 

TEXTURE/l 

THRHC/S 

THRHCS/l 

THRT/T 

THRTCH/2 

THnunnTURCIST/lTHE/T 

THEIR/5 

THEH/T 

THEOREH/T 

THEORETICRL/2 

THEORV/T 

THERE/T 

THERIIOnETER/2 

THESE/T 

THEY/5 

THIEVES/2 

THIRTEEH/5 

THIRTY/l 

THIS/T 

THonns/1 

THORN/2 

THORNDYXE/1 

THOSE/'T 

THOUGHT/! 

THREE/1 

THREH/2 

THROUCH/5 

THRUn/2 

TIBETflN/2 

TlCHT/2 

TILL/1 

TIHE/S 

TINES/l 

TinPRNI/2 

TIPPET/2 

TITLE/5 

TITLEO/2 

TITLES/T 

TO/T 

TOENRIL/2 

TonnHnux/2 

TONIGHT/2 

TOPER/2 

TOPIC/5 

TOPICS/T 

TOPOLOGY/1 

TOROTH/2 

TOTflU/2 

TOUT/2 

TRRCE/2 

TRRCEDY/2 

TRRNSRCTION/5 

TRflN.SRCriOKS/5 

TRflHSflrLflHriC/2TRRHSFER/l 

TRRNSITION/T 

TRRNSITIVE/2 

TRRNSnir/5 

TRRNSniTTING/S  TRRNaPORT/2 

TREflT/2 

TREES/1 

TREPIORTlON/2 

TRICHINRE/2 

TRILOCY/2 

TRlSECT/2 

TROLL/2 

TROUBLE/1 

TROU/2 

TRUNCHEOH/2 

TRV/5 

TUBIHG/2 

TUHOR/2 

TURBOT/2 

TURRET/2 

TUTOR/1 

TUTORIRL/l 

TUTORIHG/1 

TV/1 

THELVE/1 

TUENTY/1 

TUILIGHT/2 

TUO/5 

TYPES/1 

TYPHOON/2 

U.S./l 

UHR/T 

ULLNON/l 

ULTRRnflRINE/2 

UNflLLOYEO/2 

UNCERTRIN/2 

UNCURL/2 

UNDERESTinflTE/ZUNDERPROOUCTION 

UN0ER5TflN01NC/5UN0£RUnrEfl/2 

UNERRlNC/2 

UNHRRNES5/2 

UHlFORtl/T 

UNITHRIRH/2 

UHIVERSRLS/1 

UNniSTRKRBLE/2  UHRRVEL/2 

UP/T 

UPHERVRL/2 

UPROnR/2 

URRNIUn/2 

URINnLYSlS/2 

US/5 

USE/T 

USING/l 

USSR/1 

USURLLV/l 

USURPRTION/2 

VRCflNT/2 

VRIN/2 

VRLIORTION/Z 

VRHROIUn/2 

VRQUERO/2 

VRRIETY/1 

VflRSlTY/2 

VEERY/2 

VELOCITV/2 

VENERERL/2 

VERRCIOUS/2 

V£RincmOM/S  VERILY/2 

VERSICLE/2 

VERVE/2 

VET/2 

VIBRflTIOH/2 

VIC/1 

VICTini2flT10N/2VIEUS/l 

VICIL/2 

VILLUS/2 

VIOLflTION/2 

VIRTURL/2 

VISCUS/2 

VISION/T 

VISURL/1 

VIVIFIER/2 

VOCUE/2 

VOLUDLV/2 

VOLUIICS/5 

VOTIVE/2 

VULNERRB IL I TY/2URGER/2 

URITRESS/2 

URLDIHCER/l 

URLLY/l 

URIlPUn/Z 

URNT/S 

URREROOn/2 

UflRY/2 

HRS/T 

MflSN'T/1 

HflTERPOMCR/2 

UHTSON/t 

HflVEFORHS/T 

UE/5 

UE’D/S 

UE'RE/l 

UE’VE/1 

UERLVl 

ueri:fish/2 

UERTHERUaRH/2 

UElRn/2 

UEIZENBRUn/l 

HERE/5 

UEREH’T/1 

HHRCi:/2 

HHRT/T 

MHRT*S/1 

UHEN/T 

NHERE/T 

UHERERS/2 

uhich/t 

UHIFFLETREE/2 

UHIPPOORHlLL/2  HHITEFISH/2 

NHO/T 

UHOLLY/2 

NHY/S 

uiaaH/2 

HILFtfL/2 

HILCS/l 

MILL/T 

HIHOFRIUZ 

MINOCRRO/r 

UtNOGRflD'8/6 

UINSTON/1 

UINTERGREEN/2 

UISH/5 

UITH/1 

UITHE/2 

MOflO/2 

UOODS/T 

UOODY/1 

UOOL/2 

UORO/1 

UOROS/1 

UORK/5 

UORLD/1 

UORH/2 

UOULD/1 

URRNGLE/2 

urini;le/2 

URlTE/1 

URITING/T 

URITTEN/T 

NROTE/T 

XENlR/2 

YRnnER/2 

YERR/T 

YERRH/2 

YERRS/5 

YES/1 

YON/2 

YORICX/1 

YOU/T 

YUnnY/2 

ZINC/2 

ZOHRR/1 

ZOOlO/2 

ZUCXER/l 

Appendix  C:  Schwa  Deletion  Rules 

The  following  two  tables  summarize  the  schwa  deletion  rules  used  by  Noah. 
Words  from  the  20,000  word  dictionary  were  grouped  according  to  the  context  about  the 
schwa  (stress  of  syllables,  location  of  syllable  boundaries  and  preceding  and  following 
phoneme).  A rule  is  based  on  the  very  subjective  test  of  whether  (he  words  of  a group 
"sound  right"  for  carefully  articulated  speech,  when  the  schwa  is  deleted.  This  resulted  in 
a schwa  being  deleted  when  it  appears  as  1)  a one  phoneme  syllable  (the  first  table)  and 
2)  with  one  onset  (the  second  table),  for  the  left  and  right  phoneme  contexts  given  by  the 
tables.  (The  right  phoneme  context  appears  at  the  top:  L,  M,  N,  and  R;  the  left  phoneme 
context  appears  at  the  left  margia)  For  both  cases,  the  syllable  preceding  the  syllable 
with  the  schwa  must  be  stressed  (indicated  by  a "1"  before  the  syllable);  the  syllable 
following  the  syllable  with  the  schwa  must  be  unstressed  (no  number  before  the  syllable). 

One  of  three  conditions  are  indicated  for  each  position  in  the  tables:  1)  (no 
samples)  --  No  words  were  found  with  a schwa  in  the  indicated  context,  2)  An  underlined 
word  with  a pronunciation  — an  example  of  a word  for  which  the  schwa  is  deleted  by  rule, 
and  3)  A word  which  is  not  underlined,  with  a pronunciation  — an  example  of  a word  for 
the  indicated  context  in  which  the  schwa  is  not  deleted.  (Of  course,  (here  are  many 
contexts  not  given  by  the  table  for  schwas  which  were  not  deleted).  Thus,  rules  are 
indicated  in  the  table  by  underlined  words. 


Schwa’i  Appearing  as  'AX* 


RlGHTi  L 

LEFT! 

N 

N 

R 

B 

PHROBOLR 

P RX-IR  RE  B -RX-L  RX 

(NO  SRHPLES) 

CABINET 

IK  RE  B -RX-N  RX  T 

BfiflBEBI 

IR  RR  B -RX-R  lY 

OH 

(NO  SRHPLES) 

(NO  SRHPLES) 

(NO  SRHPLES) 

BSEIilBEll 

IB  R EH  OH-RX-R  RX  N 

D 

PUPOLIHG 

IP  RX  0 -RX-L  IH  NX 

ROflHflHT 

IRE  0 -RX-H  RX  N T 

CRROIHRL 

IK  RR  R 0 -RX-N  EL 

eeqebbl 

IF  EH  0 -RX-R  EL 

F 

SYPHILIS 

IS  IN  F -RX-L  RX  S 

(NO  SRHPLES) 

oefinite 

10  EH  F -RX-N  RX  T 

REFEBEUCE 

IR  EH  F -RX-R  RX  N S 

C 

(UCCUlUl 

IN  IN  C -RX-L  IH  NX 

BIGRHV 

IB  IH  G -RX-H  lY 

RCONY 

IRE  G -RX-N  lY 

lU  RE  G -RX-R  lY 

K 

CHOCOLRTE 

IT  SH  RR  X -RX-L  RX  T 

HEOICRHENT 

..10  IH  K-RX-H  RX  N T 

(NO  SRHPLES) 

HICKORY 

IHH  IH  K -RX-R  lY 

n 

FRHILV 

IF  RE  n -RX-L  ir 

(NO  SRHPLES) 

(NO  SRHPLES) 

summsi 

IS  RX  N -AX-R  lY 

N 

FINRLLV 

IF  RY  N -RX-L  lY 

niNlHRL 

in  IH  N -RX-H  EL 

(NO  SRHPLES) 

S££im 

IS  lY  N -RX-R  lY 

p 

HRPPLIY 

IHH  RE  P -RX-L  lY 

(NO  SRHPLES) 

oceuhil 

lOU  P -AX-N  IH  NX 

SLIPPERY 

IS  L IH  P -AX-R  lY 

R 

BflBBUSl 

in  ON  R -RX-L  RX  8 T 

crrrhel 

IK  RE  R -RX-H  EL 

DOBUiU 

in  RE  R -RX-N  ER 

(NO  SRHPLES) 

SH 

gflCHELOB 

IB  RE  T SH-RX-L  ER 

(HO  SRHPLES) 

HRTIONRL 

IN  RE  SH-RX-N  EL 

(uom 

IN  RE  T SH-RX-R  EL 

S 

Q£S1MIL 

10  EH  S -RX-L  RX  T 

SPLCIBEB 

IS  P EH  S -RX-H  RX  N 

LflKCEHi 

IL  RR  R S -RX-N  lY 

CURSORY 

IK  ER  S -RX-R  lY 

TH 

CHTHflLIt 

IK  RE  TH-RX-L  IH  K 

RHRTHEHR 

RX-IN  RE  TH-RX-H  RX 

(NO  SRHPLES) 

PLETHORB 

IP  L EH  TH-RX-R  RX 

V 

JRVFLIH 

10  ZH  RE  V -RX-L  RX  N 

(NO  SRHPLES) 

RVEHUE 

IRE  V -RX-N  Y UU 

BRBYERY 

IB  R EY  V -RX-R  lY 

ZH 

numna 

IN  RR  D ZN-RX-L  ER 

RFCIHENT  BEBinmiL 

IR  EH  0 ZH-RX-n  RX  N T IR  IH  0 ZN-RX-N  EL 

SUBCEBI 

IS  ER  D ZH-RX-R  lY 

Z 

HERSLY 

in  lY  Z -RX-L  lY 

flzinuTH 

IRE  Z -RX-H  RX  TH 

PILSNER  • 

IP  IH  L Z -RX-N  ER 

niSERY 

in  IH  Z -AX-R  lY 

r ^ 


Appendix  C:  Schwa  Deletion  Rules 


113 


Schwas  Appearinf  as  -<onset  phoneme)  AX  - 

When  a schwa  is  deleted  for  the  phoneme  context  given  below,  the  left  phoneme 
(l.e.,  the  onset  phoneme)  is  merged  with  the  following  or  preceding  syllable  by  the  rule:  if 
a legal  onset  fas  defined  by  the  onset  lexicon  given  in  Appendix  B)  is  formed  when  the 
phoneme  is  appended  to  the  onset  of  the  following  the  syllable,  the  new  onset  is  used  in 
the  pronunciation;  otherwise,  if  a legal  coda  (as  defined  by  the  coda  lexicon)  is  formed 
when  the  phoneme  is  appended  to  the  coda  of  the  preceding  syllable,  the  new  coda  is 
used  in  the  pronunciation;  otherwise  the  schwa  is  not  deleted.  There  is  one  exception  to 
this:  if  the  left  phoneme  is  a ”Y",  it  is  deleted  with  the  schwa.  Again,  the  context,  for 
which  a schwa  is  deleted,  is  indicated  by  an  underlined  word. 


RIGHT:  L 

n 

N 

R i 

LEFT: 

B 

JUDILRHT 

(HO  SRNPLES) 

COHCUBINRGE 

i 

NEIGHBORING 

10  ZH  uu-B  nx-L  nx  h 

T 

..lUU-B  RX-N  IH  0 ZH 

IN  EY-B  RX-R  IH  NX 

0 

INOOLEHT 

RBOOMEN 

NOLYBDENUN 

BOUNDRRY  . 

IIH  N -0  nx-L  nx  N T 

IRE  B -D  RX-N  RX  N 

..IL  IH  B -0  RX-H  ER 

IB  RN  N -0  RX-R  lY  i 

F 

(NO  snnPLES) 

INFRNOUS 

SYNPHONY 

OFFERING 

IIH  N -F  RX-N  RX  S 

IS  IH  N -F  RX-N  lY 

IRO-F  RX-R  IH  NX  j 

G 

PERCOLR 

(NO  SRNPLES) 

ORGRNIST 

fingering 

IP  ER-G  RX-L  RX 

IRO  R -G  RX-N  RX  S T 

IF  IH  NX-G  RX-R  IH  NX 

K 

vocniisT 

RLCHEHISr 

BaiCOHY 

BBKERY  1 

IV  OU-K  RX-L  RX  S T 

IRE  L -K  RX-N  RX  S T 

IB  RE  L -E  RX-N  lY 

IB  EY-K  RX-R  lY  1 

n 

NORnnLLV 

RRNRNENT 

RUNINRNT 

RDNIRRBLE  1 

IN  flO  R -n  RX-L  lY 

IRR  R -N  RX-N  RX  H T 

IR  UU-N  RX-N  RX  N T 

IRE  O-N  RX-R  RX-B  EL  | 

p 

eimeusu 

(NO  SRNPLES) 

TINPRNI 

RSPIRIN 

IP  ER-P  RX-L  IH  SH 

IT  IH  N -P  RX-H  lY 

IRE  S -P  RX-R  RX  N ] 

s 

UtSULU 

PRQXINRTE 

COHSONRHCE 

NENSURRBLE  j 

IIH  N -S  RX-L  RX  N 

IP  R RR  E-S  RK-N  RX  T 

IK  RR  H-S  RX-N  RX  N S 

IN  EH  N-S  RX-R  RX-B  EL 

TH 

(NO  SRNPLES) 

(NO  SRNPLES) 

(NO  SRNPLES) 

lutherrn 

IL  UN-TH  RX-R  RX  N 1 

T 

(NO  SnnPLES) 

ESTINRBLE 

DESTINY 

nystehy  j 

lEH  8-T  RX-N  RX-B  EL 

10  EH  S -T  RX-N  lY 

in  IH  S -T  RX-R  lY 

V 

JOCULRR 

OOCUNENT 

RLIENRBLE 

RUXlLIRRY 

10  ZH  RR  X-Y  RX-L  ER 

10  RR  E-Y  RX-N  RX  N T 

lEY  L-Y  RX-N  RX-B  EL 

no  C-IZ  IH  L-Y  RX-R  lY  1 

z 

CRUSRLLV 

(NO  SRNPLES) 

(NO  SRNPLES) 

ROSRRY  J 

IK  RO-Z  RX-L  lY 

IR  OH-Z  RX-R  lY  i 

Appendix  E:  Training  and  Test  Utterances 

Training  Utterances  --174  Utterances 

Trsinini  S«<  LAA  ->  20  uttorancM 


f 

t 


i 


I 

i 


I 

I 

i 


PLEASE  HELP  ME 

WHAT  SHOULD  I ASK 

WHAT  CAN  THE  SYSTEM  DO 

THE  FIRST  TWO 

GIVE  ME  ONE  MORE  PLEASE 

THANK  YOU  I'M  DONE 

STOP  TRANSMITTING  PLEASE 

WHO  WROTE  n 

WHO  WAS  THE  AUTHOR 

WHAT  WAS  ITS  TITLE 

WHEN  WAS  n PUBLISHED 

WHAT  ABOUT  MINSKY 

WHICH  IS  THE  OLDEST 

WHAT  FACTS  ARE  STORED 

PLEASE  LIST  THE  AUTHORS 

PRINT  THE  NEXT  ONE 

WHERE  DOES  HE  WORK 

WHAT  IS  HER  AFFILIATION 

WHAT  ABOUT  FORMAL  SEMANTICS 

WHAT  ABOUT  PROGRAM  VERIFICATION 

Triinint  S«<  LAB  ~ 20  u««r«no»i 

ARE  ANY  ARTICLES  BY  REDDY 
WHAT  HAS  DREYFUS  WRITTEN  LATELY 
LIST  THE  ABSTRACTS  BY  NEWELL  OR  SIMON 
DO  ANY  PAPERS  CITE  NILSSON 
DO  MANY  ABSTRACTS  DISCUSS  SYNTAX 
HOW  MANY  PAPERS  REFER  TO  FRAME  THEORY 
WHERE  IS  PREDICATE  CALCULUS  MENTIONED 
ARE  NEURAL  NETWORKS  MENTIONED  ANYWHERE 
DO  ANY  OF  THESE  MENTION  PSYCHOLOGY 
IS  HEURISTIC  PROGRAMMING  MENTIONED 
WHO  HAS  WRITTEN  ABOUT  PAHEHN  MATCHING 
WHEN  WAS  THAT  BOOK  WRIHEN 
GIVE  ME  THE  DATE  OF  THAT  ABSTRACT 
WHAT  IS  THE  TITLE  OF  THAT  PAPER 
WHAT  IS  THE  SIZE  OF  THE  DATA  BANK 
WHAT  ADDRESS  IS  GIVEN  FOR  THE  AUTHORS 
GIVE  THE  AUTHOR  AND  DATE  OF  EACH 
HOW  MANY  REFERENCES  ARE  GIVEN 
PLEASE  MAKE  ME  A FILE  OF  THOSE 
CAN  I HAVE  THESE  ABSTRACTS  LISTED 


Appendix  C:  Training  and  Test  Utterances 

Trainiftf  S«<  LAC  — 20  uMaranovt 


115 


DID  ANY  AI  JOURNAL  PAPERS  CITE  WOODS 
ARC  ANY  BY  UHR 

THE  AREA  I'M  INTERESTED  IN  IS  UNDERSTANDING 
WHAT  ARE  SOME  OF  THE  AREAS  OF  ARTIFICIAL  INTELLIGENCE 
ARE  YOU  ALWAYS  THIS  SLOW 
WHAT  CAN  I DO  TO  SPEED  YOU  UP 

AREN'T  THERE  ANY  ABSTRACTS  SINCE  NINETEEN  SEVENTY  FIVE 

LET’S  RESTRICT  OUR  ATTENTION  TO  PAPERS  SINCE  NINETEEN  SEVENTY  FOUR 

WHAT  SORTS  OF  RECOGNITION  DEVICES  ARE  WRITTEN  UP 

DO  ANY  OF  THESE  ALSO  MENTION  PATTERN  RECOGNITION 

DOES  PATTERN  DIRECTED  FUNCTION  INVOCATION  GET  MENTIONED  ANYWHERE 

IS  RESOLUTION  THEOREM  PROVING  MENTIONED  IN  AN  ABSTRACT 

HOW  MANY  OF  THESE  ALSO  DISCUSS  ABSTRACTION 

ANY  ABSTRACTS  REFERRING  TO  DYNAMIC  CLUSTERING 

WHEN  WAS  CELL  ASSEMBLY  THEORY  LAST  REFERRED  TO 

WHICH  COGNITIVE  PSYCHOLOGY  CONTAINS  WINOGRAD'S  ARTICLE 

DONT  GET  ME  ANY  ARTICLES  WHICH  MENTION  GAME  PLAYING 

DOES  THAT  ARTICLE  MENTION  TIME  OR  SPACE  BOUNDS 

WHICH  PAPERS  ON  LANGUAGE  UNDERSTANDING  ARE  ABOUT  ENGLISH 

WHICH  PAPERS  ON  CONTROL  ALSO  DISCUSS  GRAIN  OF  COMPUTATION 

Trainina  Sal  LAD  — 34  utlaranoaa 

DO  ANY  PAPERS  CITE  NILSSON 

HAVE  ANY  NEW  PAPERS  BY  NEWELL  APPEARED 

DO  YOU  HAVE  ANY  NEW  PAPERS  ON  SPEECH  UNDERSTANDING 

GIVE  ME  THE  DATE  OF  THAT  ABSTRACT 

HOW  MANY  PAPERS  REFER  TO  FRAME  THEORY 

I AM  INTERESTED  IN  LANGUAGE  UNDERSTANDING 

DO  MANY  ABSTRACTS  DISCUSS  SYNTAX 

WHAT  IS  THE  TITLE  OF  THAT  PAPER 

WHO  WROTE  IT 

IS  HEURISTIC  PROGRAMMING  MENTIONED 
LIST  THE  ABSTRACTS  BY  NEWELL  OR  SIMON 
GIVE  THE  AUTHOR  AND  DATE  OF  EACH 
THE  FIRST  TWO 
PRINT  THE  NEXT  ONE 

HOW  MANY  PAPERS  DISCUSS  HILL  CLIMBING 
00  ANY  OF  THESE  ALSO  MENTION  PATTERN  RECOGNITION 
ARE  ANY  BY  UHR 
WHAT  ABOUT  MINSKY 

WHAT  IS  THE  TITLE  OF  THE  MOST  RECENT  ONE 
WHICH  ARTICLES  REFER  TO  THESE 
ARE  ANY  ARTICLES  BY  REDDY 
WHICH  IS  THE  OLDEST 

HOW  MANY  ABSTRACTS  ARE  THERE  ON  PROBLEM  SOLVING 

DO  ANY  PAPERS  DISCUSS  PLANNER-LIKE  LANGUAGES 

ARE  THERE  ANY  RECENT  ARTICLES  IN  CACM 

WHERE  DID  THAT  ARTICLE  APPEAR 

WHEN  WAS  IT  PUBLISHED 

WHAT  HAS  DREYFUS  WRITTEN  LATELY 

WHO  HAS  WRITTEN  ABOUT  PATTERN  MATCHING 

IS  THERE  ANYTHING  NEW  REGARDING  SEMANTIC  NETS 

WHAT  ABOUT  FORMAL  SEMANTICS 

HAVE  ANY  ARTICLES  APPEARED  WHICH  MENTION  HEARSAY 

WHAT  ARE  THE  TITLES  OF  THE  RECENT  ARPA  SURNOTES 

HOW  MANY  ARTICLES  ON  PRODUCTION  SYSTEMS  ARE  THERE 


116 


Traiiun(  Sat  LMA  -•  20  ullaranoai 


WHICH  SUMMARIES  ON  At  CONSIDER  PATTERN  RECOGNITION  IN  ADDITION 

WHAT  ARE  THEIR  AFFILIATIONS 

WHAT  ADDRESSES  ARE  GIVEN  FOR  THE  AUTHORS 

WHAT  ISSUES  DURING  JANUARY  AND  JULY  CONCERN  CONTROL 

LET  US  CONFINE  OURSELVES  TO  JOURNALS  AFTER  FEBRUARY  NINETEEN  FIFTY 

TELL  ME  THE  TITLES  OF  THE  EARLIEST  TEN 

CHOOSE  AMONG  VOLUMES  BEFORE  NINETEEN  SIXTY 

WHICH  OF  THESE  APPEARED  RECENTLY  IN*  THE  IEEE  TRANSACTIONS 

HOW  MANY  BOOKS  WERE  PRODUCED  FROM  MARCH  TO  DECEMBER 

HOW  BIG  IS  THE  DATA  BASE 

QUIT  LISTING  PLEASE 

CEASE  PRINTING 

LIST  THE  NEXT  FOURTEEN  HUNDRED 

IS  THERE  AN  IFIP  CONVENTION  ISSUE  FROM  MAY  OR  JUNE 

I DEMAND  ANOTHER  ARTICLE  AFTER  AUGUST  NINETEEN  THIRTEEN 

DID  THE  SIGART  NEWSLETTER  PUBLISH  ANYTHING  IN  OCTOBER  OH  NOVEMBER 

WE  WANT  SOME  REVIEWS  CONCERNING  PERCEPTRONS 

DIO  NEWELL  PRESENT  A PAPER  AT  THE  IFIP  MEETINGS  IN  SEPTEMBER 

HOW  MANY  PAPERS  FROM  APRIL  THROUGH  AUGUST  CONCERNED  CHESS 

Wro  LIKE  TO  SEE  THE  TITLES  FROM  PROCEEDINGS  OF  THE  ACM  CONFERENCE 


Trainint  Sat  LMB  -•  20  uttaranoaa 
GENERATE  A COPY  OF  THOSE 

WE  DESIRE  A PROCEEDING  OF  THE  ACM  MEETING  REFERENCED  BY  NEWELL 

COULD  YOU  RETRIEVE  SOMETHING  FROM  INFORMATION  AND  CONTROL  DISCUSSING  AI 

DID  ANYONE  PUBLISH  ABOUT  LEARNING  IN  COMMUNICATIONS  OF  THE  ACM 

FINISH  PRINTING 

WHAT  ARE  THE  KEY  PHRASES 

I’D  LIKE  TO  SEE  THE  MENUS 

HAVEN’T  YOU  FINISHED 

WHICH  STORIES  IN  THE  SIGART  NEWSLETTER  HAVE  BEEN  DISCUSSING  CONTROL 
TRANSMIT  THE  NEXT  EIGHTEEN 

HASN’T  UNGUAGE  UNDERSTANDING  BEEN  CONSIDERED  IN  COMPUTING  REVIEWS 

LET  ME  LIMIT  MYSELF  TO  REPORTS  ISSUED  SINCE  NINETEEN  FIFTEEN 

HASN'T  A CURRENT  REPORT  ON  SPEECH  UNDERSTANDING  BEEN  RELEASED 

DIDN’T  THAT  PAPER  QUOTE  NILSSON 

ARE  NOT  SOME  OF  THESE  FROM  COMPUTING  SURVEYS 

DOESN’T  THIS  PAPER  REFERENCE  AN  IEEE  TRANSACTION 

WHY  IS  THE  SYSTEM  SO  SLOW 

KILL  THE  LISTING 

WHAT  KINDS  OF  SUBJECTS  ARE  STORED 
WHICH  SORT  OF  RETRIEVAL  KEYS  CAN  I SEEK 


Trainint  Sat  LMC  — 20  uttaranoaa 


SELECT  FROM  ARTICLES  ON  LANGUAGE  UNDERSTANDING 
SUeSELECT  FROM  GAME  PLAYING 
WHAT  SUBJECT  CAN  I REQUEST 

WE  WISH  TO  GET  THE  LATEST  FORTY  ARTICLES  ON  ASSOCIATIVE  MEMORIES 

GIVE  ME  SOMETHING  MENTIONING  ABSTRACTION 

PLEASE  TERMINATE  TRANSMITTING 

WHAT  SORT  OF  SUMMARY  IS  AVAILABLE 

rO  LIKE  TO  KNOW  THE  PUBLISHERS  OF  THAT  STORY 

SHOW  ME  ITS  PUBLISHER 

WHAT  TOPIC  MENU  CAN  1 CHOOSE 

SHOW  ME  THE  LATEST  ELEVEN 


Appendix  E:  Training  and  Test  Utterances 


117 


ARE  ANY  or  THESE  EROM  THE  lEIP  SESSCONS  IN  THE  MONTH  OF  JUNE 

THE  LATEST  SIXTEEN  PLEASE 

DURING  WHAT  MONTHS  WERE  THEY  PUBLISHED 

WHO  WAS  QUOTED  IN  THAT  ARTICLE 

PRODUCE  A COPY  OF  THE  NEWEST  EIGHTY  ARTICLES 

DO  ANY  RECENT  ACM  CONFERENCES  CONSIDER  PSYCHOLOGY 

DID  ANY  IEEE  CONVENTIONS  PUBLISH  PROCEEDINGS 

WAS  NEWELL  CHEO  BY  ANY  REPORTS  ISSUED  IN  THE  LAST  NINETY  YEARS 

TRY  TO  GET  SURVEYS  PRINTED  IN  THE  LAST  EIGHTY  MONTHS 

Training  Sal  LLA  — 20  uttaranoaa 

DO  ANY  PAPERS  CITE  MICHAEL  ARBI8 

HAVE  ANY  NEW  PAPERS  BY  ISSAC  ASIMOV  APPEARED 

DO  YOU  HAVE  NEW  PAPERS  ON  ACQUISITION  OF  KNOWLEDGE 

HOW  MANY  PAPERS  REFER  TO  ACTIVE  KNOWLEDGE 

I AM  INTERESTED  IN  ADAPTATION 

DO  MANN'  ABSTRACTS  DISCUSS  AN  ADAPTIVE  NATURAL  LANGUAGE  SYSTEM 

IS  ALGEBRAIC  REDUCTION  MENTIONED 

LIST  THE  ABSTRACTS  BY  RAJ  REDDY  OR  HARRY  BARROW 

HOW  MANY  PAPERS  DISCUSS  ALGOL 

DO  ANY  OF  THESE  ALSO  MENTION  AUTOMATIC  COOING 

ARE  ANY  BY  HANS  BERLINER 

WHAT  ABOUT  DANNY  BOBROW 

ARE  ANY  ARTICLES  BY  BRUCE  BUCHANAN 

HOW  MANY  ABSTRACTS  ARE  THERE  ON  ADAPTIVE  PRODUCTION  SYSTEMS 

DO  ANY  PAPERS  DISCUSS  ADVISING  PHYSICIANS 

WHAT  HAS  HERB  SIMON  WRITTEN  LATELY 

WHO  HAS  WRIHEN  ABOUT  ALGORITHMIC  AESTHETICS 

IS  THERE  ANYTHING  NEW  REGARDING  ALL-OR-NONE  SOLUTIONS 

WHAT  ABOUT  ANALOGY  IN  PROBLEM  SOLVING 

HOW  MANY  ARTiaES  ON  ANALYSIS  OF  CONTEXT  ARE  THERE 


Testing  Utterances  — 105  Utterances 

Noah’s  performance  for  each  utterance  is  given  for  the  1000-word  vocabulary. 
A dashed  line  indicates  that  the  word  was  not  hypothesized;  a number  after  a word 
gives  the  rank  of  the  hypothesis. 

Tast  Sal  LRF  - 2B  uttarancos 

1.  (UHRT  PAPERS  ON  GRRmiDTICRL  INFERENCE  RRE  THERE  > 

UHflT  « PAPERS  1 — CRflnnriTICRL  1 INFERENCEia  RREll  THERE  2 

2.  (ANY  ABSTRACTS  REFERRING  TO  DYNANIC  CLUSTERING  ) 

ANY  1 ABSTRACTS  1 REFERRING  1 TO  3 DYNAMIC  1 

3.  (UHICH  PAPERS  CITE  FEIGENRAUM  AND  FELDMAN  ) 

WHICH  1 PAPERS  1 CITE  2 AND  7 

4.  (RRE  THERE  ANY  NEU  PAPERS  DN  GRAPH  MATCHING  ) 

ARE12  THERE  2 ANY  1 NEW  S PAPERS  1 ON  3 MATCHING  S 

5.  (IS  RESOLUTION  THEOREM  PROVING  MENTIONED  IN  AN  ABSTRACT  > 

IS  1 MENTIONED  2 IN  2 AN  6 ABSTRACT  S 

6.  (GET  ME  EVERYTHING  ON  UNIFORM  PROOF  PROCEDURES  ) 

me  1 UNIFORM  7 PROOF  2 PROCEDURES  1 

7.  (NO  MORE  PLEASE  ) 

NO  1 MORE IS  PLEASE  2 

8.  (GIVE  ME  ONE  MORE  PLEASE  ) 

CIVEia  ME  1 ONE  3 MORE  1 PLEASE  1 

9.  (WHO  WROTE  PAPERS  ON  PRODUCTION  SYSTEMS  THIS  YEAR  ) 


UNO  2 UROTE  1 PRPERS  1 ON  1 PRODUCTION  1 THIS  3 

(DID  RNV  ni  JOURNRL  PAPERS  CITE  HOODS  ) 

RNV  i RI  1 JOURNAL  1 PAPERS  3 CITE  1 


(00  ALL  QUERIES 
— ALL  1 


(ARE  YOU 
ARE  1 


ALUAVS 


TAKE  THIS  LONG 

TAKE  3 THISia  LONG  9 

THIS  SLOU  ) 

THIS  2 SLOH  3 

IT  TAKE  ) 


— DOES  1 IT  1 TAKE  2 


ANSHER 


UHEH  3 HILL  3 — HAVE  8 — RNSItCR  5 


IT  AL'nhVS  TAKE 

IT  2 RLURVS  3 

RESPONSES  EVER  Cl 


E THIS  LONG 

- THIS  4 

CORE  FASTER 


TO  RN8UER  HE 
TO  4 RNSHER  1 HE  2 


DO  4 RESPONSES  7 CONE  1 FRSTER12 


HHAT  2 CRN  1 I 3 00  1 TO  1 SPEED  1 YOU  6 UP  4 
(HON  CRN  I USE  THE  SYSTEH  EFFICIENTLY 
CAN  2 I 6 USE  3 THE  2 SYSTEH  2 


I 1 ASK  2 


(UHAT  DO  1 HRVE  T 

UHRT  1 ~ 14  HRVE  1 TO  1 00  1 

(HELP  ) 

HELP  3 

(CRN  YOU  HELP  HE 

CRN  1 YOU  2 HE  1 

(PLEASE  HELP  HE  ) 

PLEASE  1 HELP  3 HE  1 

(UHRT  SHOULD  1 ASK 

UHRT  5 SHOULD  312  

(UHEH  HAS  THE  LAST 


HOLLAND  PUBLISHED 


UHEH  7 HAS  3 THE  1 LAST  1 PRPER  1 BY  3 


Tact  S«t  LHN  — 20  utttmne&a 

1.  (ANY  ABSTRACTS  REFERRING  TO  AI  OR  Al 

ANY  2 ABSTRACTS  7 REFERRING  I TO  9 AI  1 OR  3 > 

2.  (ARE  ASSOCIATIVE  HEHORIES  DISCUSSED  IN 

ARE  2 DISCUSSED  3 IN 

3.  (ARE  LEARNING  AND  NEURAL  NETUORKS  HEN 

ARE  2 LEARNING  S AND  S NETUORKS  1 DEN 

4.  (DID  REDDY  PRESENT  A PRPER  AT  IJCRI 

DIO  B REDDY  7 A I PAPER  I - 

5.  (DIDN'T  THAT  PRPER  QUOTE  DREYFUS  ) 

THAT  2 PRPER  1 DREYFUS  1 

6.  (DOES  PICTURE  RECOGNITION  GET  HENTIONED 


0 AI  OR  ARTIFICIAL  INTELLIGENCE 

0 9 AI  1 OR  3 

DISCUSSED  IN  RECENT  JOURNALS  > 
DISCUSSED  3 IN  4 RECENT  3 JOURNALS  1 
NETUORKS  HENTIONED  RNYUHERE  ) 
NETUORKS  1 HENTIONED  2 RNYUHERE  2 
ER  AT  IJCRI  ) 


RNYUHERE 


DOES  S PICTURE  1 RECOGNITION  1 GET  2 HENTIONED  2 RNYUHERE  2 
(GET  HE  EVERYTHING  ON  DYNRHIC  CLUSTERING  ) 


7.  (GET  HE  EVERYTHING  ON  DYNRHIC  CLUSTERING 

HE  1 ON  6 OYHAHIC  1 

8.  (GENERATE  A COPY  OF  THOSE  ) 

GENERATE  1 - COPY  S OF  2 TH0SEI3 

9.  (GIVE  HE  THE  DATE  OF  THAT  ABSTRACT  ) . 

CIVE19  HE  3 THE  3 DATE  S OF  2 THAT  1 ABSTRACT  1 
ID.  (HOU  CRN  I USE  THE  SYSTEH  EFFICIENTLY 

NON  1 CAN  1 I 8 USE  3 THE  1 SYSTEH  6 EFFICIENTLY  2 


11.  (I 

I 6 

12.  (1*0 


INTERESTED 


LEARNING 


INTERESTED  7 IN  3 LEARNING  3 


I'D  1 LIKE  1 TO  1 SEE  1 THE  1 HENUS  1 


Appendix  E:  Training  and  Test  Utterances 


119 


1 


13.  CSELECT  FROn  RRTICLES  ON  CRHC  PLRYINC  > 

FROn  1 RETICLES  1 0N13  PLRTING  1 

14.  (UHRT  RDDRCSSE8  RRE  GIVEN  FOR  THE  RUTNORS  ) 

HHRT  1 GIVEN  I FOR  3 THE  2 

15.  (UHRT  PRPERS  ON  PREFERENTIRL  SEHRNTICS  RRE  THERE  ) 

UHRT  1 PRPERS  1 ON  4 SEHRNTICS  « RRE  8 THERE  3 

16.  (UHEN  URS  R SEHRNTIC  NETUORC  LRST  REFERREO  TO  ) 

UHEN  1 URS18  - NETUORK  3 LRST18  REFERREO  1 TO  1 

17.  (UHICH  PRPERS  CITE  FELORRH  ) 

UHICH  1 PRPERS  1 CITE  3 

18.  (UHO  HRS  URITTEN  RBOUT  RUTOHRTIC  PROGRROHING  ) 

UHO  4 — RBOUT  1 PROGRRItHING  4 

19.  (UHO  HRS  QUOTED  IN  THRT  RRTICLE  ) 

UHO  1 URS  2 IN  4 THRT14 

26.  (UHICH  IS  THE  OLOEST  ) 

IS  3 THE  S OLOEST  S 

TmI  Sat  LLB  — 28  uttaraneoa 

1.  (00  RNV  OF  THESE  HEHTION  RNRLVSIS  OF  SENTENCES  ) 

— RNV  1 OF  2 THESE  1 HEHTION  1 OF  1 

2.  (UHICH  RI  TEXT  CONTRINEO  THE  RRTICLE  BV  RLLEN  NEUELL  ) 

UHICH  2 RI  1 CONTRINEO  1 THE  6 RRTICLE  1 BV  2 NEUELL  1 

3.  (UHRT  TOPICS  RRE  RELATED  TO  RUTOHATIC  PROGRRIIHIHC  ) 

UHRT  2 ARE  7 TO  S 

4.  (DOES  RS8IHILRTI0N  OF  NEU  INFORtlRTION  GET  DISCUSSED  RNVURERE  ) 

DOES  3 RSSIHILRTION  1 OF  3 NEU  1 INFORHRTIOH  1 GET16  DISCUSSED  2 

5.  (UHICH  TITLES  CONTAIN  THE  PHRRSE  flXIOHS  FOR  GO  ) 


TITLES  1 CONTAIN  1 THE  1 PHRRSE  1 FOR  1 — 

6.  (DOES  THAT  RRTICLE  HEHTION  RXIOHRTIC  SEHRNTICS  ) 
DOES  1 THRT  6 RRTICLE  1 HEHTION  1 SEHRNTICS  1 


7.  (UHICH  OF  THEN  DISCUSSES  RUTOHRTEO  DEDUCTION  ) 

UHICH  1 OF  4 DISCUSSES  1 RUTOHRTEO  1 DEDUCTION  1 

6.  (RRE  THERE  RNV  RRSTRRCTS  UHICH  REFER  TO  PRPERS  BV  BILL  HOODS  ) 

— THERE  4 RNV  I ABSTRACTS  1 REFER  I TO  7 PRPERS  1 BY  4 

9.  (UHERE  IS  RUTOHRTIC  COHPUTRTION  RNO  CONTROL  HENTIONEO  ) 

UHERE  4 IS  3 RUTOHRTIC  1 AND  2 HENTIONEO  2 

16.  (UHRT  RRE  SOHE  OF  THE  RRERS  OF  COGNITIVE  SCIENCE  ) 

UHRT  2 ARE  4 SOHE  1 OF  7 THE18  RRERS  6 OF  1 

11.  (ARE  RNV  RRTICLES  RBOUT  BIOHEDICINE  I 

RRE  2 RNV  2 RRTICLES  3 ABOUT  3 

12.  (00  ANY  OF  THE  RBS7RRCTS  HEHTION  RUGHENTEO  TRANSITION  NETUORXS  ) 

DO  3 RNV  2 OF  4 THE  3 HEHTION  1 RUGHENTEO  4 TRANSITION  1 

13.  (HOU  HRNV  OF  THESE  ALSO  DISCUSS  RUTOHRTIC  PROGRRH  URITING  > 

HOU  1 HRNV  1 OF  3 ALSO  1 DISCUSS  1 PROGRRH  9 URITING  2 

14.  (UHICH  PRPERS  ON  BELIEF  SYSTEHS  RRE  RBOUT  CAUSAL  REASONING  ) 

UHICH  1 PRPERS  1 ON  7 SVSTENS  1 RRE  4 RERSONINC  2 

15.  (00  RNV  PRPERS  ON  RUTOHRTIC  PROOF  OF  CORRECTNESS  EXIST  ) 

00  2 RNV  1 PAPERS  1 ON  4 RUTOHRTIC  1 PROOF  1 OF  9 

16.  (UHRT  RBOUT  RUTOHRTIC  PROGRRH  SYNTHESIS  FROH  EXRHPLE  PROBLEHS  ) 

UHRT  1 RBOUT  1 — FROH  2 

17.  (I  RH  INTERESTED  IH  COGNITION  > 

1 2 — INTERESTED  1 IN  3 

18.  (THE  RRER  I RH  INTERESTED  IN  IS  RUTOHRTION  ) 

THE18  RRER  112—  INTERESTED  1 IN  3 1S12  RUTOHRTION  1 

19.  (DON’T  GET  HE  RNV  RRTICLES  UHICH  HEHTION  BRCLCRHnON  ) 

get  1 RE  7 RNV  2 RRTICLES  1 UHICH  2 HEHTION  2 BRCKGRHHON  1 

28.  (1  RH  ONLY  INTERESTED  IN  PRPERS  ON  BINDINGS  ) 

I 2 — ONLY  9 INTERESTED  1 IN  1 PRPERS  1 OH  4 


120 


last  Sat  LLC  --  20  uttarancaa 


1. 

2. 

3. 

4. 

5. 

6. 

7. 

8. 

9. 

10. 
11. 
12. 

13. 

14. 

15. 

16. 


17, 


18. 

19. 

20. 


(00  RNY  PAPERS  THIS  YEAR  CITE  JOHH  HOLLAND  > 

— RNY  1 PAPERS  1 THIS  3 YEAR  8 CITE  1 JOHN  4 

(UHRT  PAPERS  ON  AUTOHATIC  THEOREM  PROVING  ARE  THERE  ) 

UHAT  2 PAPERS  1 ON  8 AUTOHATIC  1 ARE  4 

(ANY  ABSTRACTS  REFERRING  TO  THE  BERKELEY  DEBATE  ) 

ANY  1 ABSTRACTS  5 TO  1 THE  1 

(WHICH  PAPERS  CITE  A2RIEL  ROSENFELD  ) 

WHICH  1 PAPERS  1 CITE  2 

(ARE  THERE  ANY  NEW  PAPERS  ON  BUSINESS  PROBLEM  SOLVING  ) 

ARE  1 THERE  1 ANY  1 PAPERS  1 ON  6 BUSINESS  1 PROBLEH  1 

(IS  THE  BAY  AREA  CIRCLE  HENTIONEO  IN  AN  ABSTRACT  ) 

IS  1 THE  1 BAY  B IN  1 AN  2 ABSTRACT  1 

(GET  HE  EVERYTHING  ON  CARTOGRAPHY  ) 

GET  6 HEW ON  8 

(WHO  WROTE  PAPERS  ON  BRAIN  THEORY  THIS  YEAR  ) 

WHO  1 WROTE  1 PAPERS  1 ON  3 THIS  3 YEAR  4 

(010  ANY  ACL  PAPERS  CITE  TERRY  WINOGRAD  ) 

010  2 ANY  1 PAPERS  1 CITE  2 UINOCRAD  7 

(WHEN  WAS  THE  LAST  PAPER  BY  HARVIN  HINSCY  PUBLISHED  ) 

WHEN  2 WAS  1 LAST  4 PAPER  1 BY  S PUBLISHED  1 

(WHEN  HAS  CIRCUIT  ANALYSIS  LAST  REFERRED  TO  ) 

WHEN  1 WAS  4 CIRCUIT  1 LAST  1 REFERRED  1 TO  2 

(ARE  CASE  SYSTEMS  HENTIONEO  ANYWHERE  ) 

ARE  3 CASEIS  SYSTEMS  1 MENTIONED  2 ANYWHERE  2 
(00  ANY  OF  THOSE  PAPERS  MENTION  CHECKING  PROOFS  > 

DO  2 ANY  1 OF  2 THOSE  8 PAPERS  1 MENTION  2 CHECKING  1 

(WHICH  PAPERS  ON  CHESS  PLAYING  PROGRRHS  ALSO  DISCUSS  COHHON  SENSE  ) 

WHICH  1 PAPERS  1 ON  8 CHESS  6 ALSO  1 DISCUSS  1 COHnON  1 SENSE12 

(WHAT  SORTS  OF  COGNITIVE  ROBOTIC  SYSTEHS  ARE  WRITTEN  UP  ) 


WHAT  1 OF  7 ROBOTIC  2 SYSTEHS  1 ARE  1 — 

(00  RNY  AUTHORS  DESCRIBE  COHHON  SENSE  THEORY  FORHRTION  ) 

00  1 COHHON  1 SENSEIB FORHRTION  1 

(WAS  IT  PUBLISHED  BY  THE  ASSOCIATION  FOR  COMPUTATIONAL  LINGUISTICS  ) 

WAS  2 IT  2 BY  2 — ASSOCIATION  2 FOR  1 LINGUISTICS  1 

(IS  THAT  ABOUT  COMPLEX  UAVEFORHS  ) 


IS  3 THAT  1 ABOUT  1 COHPLEX  1 UAVEFORIISIS 
(WHICH  PAPER  MENTIONS  AN  ASSEMBLY  ROBOT  ) 
WHICH  1 PAPER  1 MENTIONS  1 AN  1 ASSEMBLY  9 ROBOT  2 
(IS  AN  AXIOHRTIC  SYSTEM  REFERRED  TO  ) 

— an  2 SYSTEM  1 REFERRED  1 TO  9 


Tail  Sat  LLO  — 20  uttarancaa 

1.  (DO  RNY  PAPERS  CITE  EO  FEICENBAUM  ) 

DO  1 RNY  1 PAPERS  1 CITE  1 EOIB  FEIGENBAUM  6 

2.  (HAVE  ANY  NEW  PAPERS  BY  JERRY  FELOHAN  APPEARED  ) 

HAVE  1 RNY  1 NEW  1 PAPERS  1 BY  3 — APPEARED  1 

3.  (DO  YOU  HAVE  NEW  PAPERS  ON  A CRI  MONITOR  ) 

DO  8 HAVE  2 NEN  1 PAPERS  1 ON  9 A 2 CRI  1 HONITOR  1 

4.  (HON  HANY  PAPERS  REFER  TO  A COHHON  SENSE  ALGORITHM  ) 

HON  1 HANY  1 PAPERS  2 REFER  1 — AS  COHHON  1 SENSE14 

5.  (I  AH  INTERESTED  IN  COMPUTATIONAL  LINGUISTICS  > 

I 2 — INTERESTED  1 IN  2 LINCUISTICS13 

6.  (00  HANY  ABSTRACTS  DISCUSS  COMPUTER  ART  ) 

00  4 MANY  3 ABSTRACTS  1 — RRT12 

7.  (IS  COMPUTER  MUSIC  HENTIONEO  ) 

IS  1 COMPUTERia  HENTIONEO  3 

8.  (LIST  THE  ABSTRACTS  BY  LEONARD  UHR  ) 

LIST  4 THE  2 ABSTRACTS  1 BY  1 


Appendix  E:  Training  and  Test  Utterances 


121 


9.  (HOU  nRNV  PRPERS  DISCUSS  COnPUTCR  CONTROLLED  nRHIPULRTORS  ) 

HOH  4 nRNV  1 PAPERS  2 DISCUSS  2 COHPUTER  2 CONTROLLED  6 

It.  (00  RNV  OF  THESE  RLSO  HENTIOH  COHPUTER  GRAPHICS  > 

OF  2 THESE  1 RLSO  1 HENTION  1 CRRPHICS  1 

11.  (ARE  ANY  BY  NILS  NILSSON  ) 

RREll  ANY  1 BY  1 NILSSON  7 

12.  (UHAT  ABOUT  KEN  COLBY  ) 

UHAT  1 ABOUT  1 KEN  1 COLBY  S 

13.  (ARE  ANY  ARTICLES  BY  ALLEN  COLLINS  ) 

ARE  3 ANY  3 ARTICLES  4 BY  1 

14.  (HOH  HANY  ABSTRACTS  ARE  THERE  ON  COHPUTER  VISION  ) 

HOH  1 HRNY  3 RBSTRRCTS  3 RRE  1 0N14 

15.  (DO  ANY  PAPERS  DISCUSS  COHPUTER  BASED  CONSULTATIONS  ) 

DO  2 ANY  1 PAPERS  2 DISCUSS  1 COHPUTER  1 CONSULTATIONS  2 

16.  (UHAT  HAS  LEE  ERHRH  HRITTEN  LATELY  ) 

UHAT  I HRS  1 LEE  5 ERHRH  I 

17.  (UHO  HRS  HRITTEN  ABOUT  CONCEPTUAL  DESCRIPTIONS  ) 

18.  (IS  THERE  ANYTHING  NEU  REGARDING  CONCEPTUAL  INFERENCE  ) 

IS  I THERE  1 NEH  1 REGARDING  6 INFERENCE  3 

19.  (UHAT  ABOUT  CONCEPTUAL  OVERLAYS  ) 

UHAT  1 ABOUT  2 

28.  (HAVE  RNV  ARTICLES  APPEARED  UHICH  HENTION  CONSTRAINT  SATISFACTION 

HAVE  1 ANY  2 ARTICLES  2 UHICH  1 SATISFACTION  1 


j ! 


! 1 
il 


